Do RTX 5090 and RTX PRO 5000 have some differencies which should be taken into account

Hi,

I am having problems with RTX PRO 5000 cards.
But with exactly same setup, RTX 5090 do not have any problems.
So both systems are on ubuntu 24.04.
Both uses exactly same docker image.
Both uses exactly same vLLM serve command.
But during inference the RTX PRO crashes after a while:

(EngineCore pid=72) ERROR 03-19 12:12:38 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1rc1.dev177+gd4c57863f) with config: model=‘RedHatAI/gemma-3-27b-it-FP8-dynamic’, speculative_config=None, tokenizer=‘RedHatAI/gemma-3-27b-it-FP8-dynamic’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=5000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=gemma3, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/5cb1066345’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::olmo_hybrid_gdn_full_forward’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_endpoints’: [15000], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 104, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: ‘/root/.cache/vllm/torch_compile_cache/5cb1066345/rank_0_0/backbone’, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: },
(EngineCore pid=72) ERROR 03-19 12:12:38 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-8cfc43b538a480c1-a6951e2d,prompt_token_ids_len=1249,prefill_token_ids_len=None,mm_features=,sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[‘[/INST]’, ‘User Input:’, ‘<user_input>’], stop_token_ids=[106], bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=250, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids
EngineCore encountered a fatal error.
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Traceback (most recent call last):
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py”, line 1092, in run_engine_core
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] engine_core.run_busy_loop()
torch.AcceleratorError: CUDA error: unspecified launch failure
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Search for cudaErrorLaunchFailure' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore pid=72) ERROR 03-19 12:12:38 [core.py:1101]
(EngineCore pid=72) Process EngineCore:
(EngineCore pid=72) Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 663, in output_handler
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 1022, in get_output_async
(APIServer pid=1) ERROR 03-19 12:12:38 [async_llm.py:707] raise self._format_exception(outputs) from None

Your RTX PRO 5000 crash with “CUDA error: unspecified launch failure” is likely due to CUDA kernel or driver incompatibility, or a hardware-specific issue. The RTX 5090 working fine with the same setup suggests the problem is specific to the RTX PRO 5000, not your vLLM or Docker configuration. This type of error is often seen with unsupported CUDA architectures, driver mismatches, or missing kernel images for the device, as documented in similar vLLM issues for Blackwell/RTX PRO cards and in the CUDA documentation (see here).

To troubleshoot, ensure your NVIDIA driver and CUDA toolkit are fully up to date and compatible with the RTX PRO 5000. Also, try setting CUDA_LAUNCH_BLOCKING=1 for more precise error reporting, and consider rebuilding vLLM and its dependencies with the correct CUDA architecture flags for your GPU. If the problem persists, it may be a known issue with vLLM and certain new GPU architectures, as seen in related vLLM issues and troubleshooting docs. Would you like more detailed step-by-step debugging instructions?

Sources:

Hint: Mention @RunLLM in the post for followups.

Yes could you give debugging instructions and which version to compare between these systems. The hosts nvidia-smi shows for both:
NVIDIA-SMI 580.126.09 Driver Version: 580.126.09 CUDA Version: 13.0
Persistence-M shows off for 5090 but on for RTX PRO

To debug this issue, first ensure both systems have identical NVIDIA driver and CUDA versions (which you confirmed: 580.126.09, CUDA 13.0). Next, compare the output of nvidia-smi for both cards, focusing on the “CUDA Version,” “Driver Version,” and “Persistence-M” settings. Persistence mode differences are generally not the cause of kernel launch failures, but you can set it to “On” for both with nvidia-smi -pm 1 for consistency (NVIDIA docs).

For step-by-step debugging:

  1. Set CUDA_LAUNCH_BLOCKING=1 in your environment before running vLLM to get accurate error traces.
  2. Run a minimal CUDA program (e.g., python -c "import torch; print(torch.cuda.is_available())") on both systems to confirm CUDA is working.
  3. Check that vLLM and all dependencies (PyTorch, CUDA libraries) are built for the correct compute capability (12.0 for Blackwell/RTX 5090/PRO 5000).
  4. If the error persists only on the RTX PRO 5000, try running with a different model or with --enforce-eager to disable CUDA graphs, as some kernels may not be compatible.
  5. Review the vLLM troubleshooting guide for additional environment variables and debugging tips.

Would you like a sample script or more details on any of these steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

With enforce-eager the inference went trough with RTX PRO but is much slower, does not use full card power. The card temps goes to 85Celsius while FAN blowing 58% using 230W.
Does this tell something?

Running with --enforce-eager disables CUDA Graphs and torch.compile optimizations, which significantly reduces inference speed and GPU utilization—this is expected behavior (vLLM debug docs). High temperature and moderate fan/power usage indicate the GPU is working, but not at full throughput, likely due to less efficient kernel launches and higher CPU overhead in eager mode.

This suggests the root issue is with CUDA Graphs or kernel compatibility on the RTX PRO 5000. The fact that inference works in eager mode but not with CUDA Graphs points to a kernel or driver mismatch, or a missing/unsupported kernel for your GPU architecture. Would you like guidance on how to further diagnose or enable CUDA Graphs safely?

Sources:

Hint: Mention @RunLLM in the post for followups.

Okey the RTX PRO host ubuntu kernel is 6.14.0-37-generic
while the working 5090 is 6.17.0-14-generic
Could this be the problem?
On the other hand, was able to run the full inference task once with the RTX PRO using the current kernel.

@RunLLM

A kernel version mismatch (6.14 vs 6.17) is unlikely to be the root cause if you were able to run a full inference once on the RTX PRO, and both kernels are recent and supported by NVIDIA’s 580.126.09 driver (NVIDIA driver compatibility). Kernel differences can affect stability or edge-case driver bugs, but not deterministic CUDA kernel launch failures—especially if the error is intermittent.

Since inference succeeded once, the issue is more likely related to CUDA Graphs, kernel caching, or a race condition, not the Linux kernel version. If you want to rule out the kernel, you can try upgrading the RTX PRO host to match the 5090’s kernel, but this is not strictly necessary unless you see consistent kernel-level errors in dmesg or the NVIDIA driver logs. Would you like more detail on how to check for kernel/driver errors or further isolate the problem?

Sources:

Hint: Mention @RunLLM in the post for followups.