Hi,
I am having problems with RTX PRO 5000 cards.
But with exactly same setup, RTX 5090 do not have any problems.
So both systems are on ubuntu 24.04.
Both uses exactly same docker image.
Both uses exactly same vLLM serve command.
But during inference the RTX PRO crashes after a while:
(EngineCore pid=72) ERROR 03-19 12:12:38 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1rc1.dev177+gd4c57863f) with config: model=‘RedHatAI/gemma-3-27b-it-FP8-dynamic’, speculative_config=None, tokenizer=‘RedHatAI/gemma-3-27b-it-FP8-dynamic’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gemma3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/5cb1066345’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [15000], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 104, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/5cb1066345/rank_0_0/backbone’, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: },
(EngineCore pid=72) ERROR 03-19 12:12:38 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-8cfc43b538a480c1-a6951e2d,prompt_token_ids_len=1249,prefill_token_ids_len=None,mm_features=,sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[‘[/INST]’, ‘User Input:’, ‘<user_input>’], stop_token_ids=[106], bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids
EngineCore encountered a fatal error.
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Traceback (most recent call last):
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1092, in run_engine_core
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] engine_core.run_busy_loop()
torch.AcceleratorError: CUDA error: unspecified launch failure
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Search for cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101]
(EngineCore pid=72) Process EngineCore:
(EngineCore pid=72) Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 1022, in get_output_async
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] raise self._format_exception(outputs) from None