Segfault in cublasLt/cuLaunchKernel on RTX 5080 using v0.21.0 (V1 Engine)

  1. Describe the bug I am experiencing a hard Segmentation Fault when running an OCR task (chandra-ocr-2) using the V1 Engine in vLLM v0.21.0. The crash happens during the forward pass of a multi-modal request, specifically at the matrix multiplication stage (cublasLtTSTMatmul).

  2. Environment

vLLM Image: vllm/vllm-openai:v0.21.0-ubuntu2404
Engine: V1 Engine (enabled by default in this tag’s configuration)
Model: chandra-ocr-2 (Multi-modal/OCR)
GPU: NVIDIA GeForce RTX 5080 (Blackwell Architecture, Compute Capability 10.0)
Driver Version: (NVIDIA-SMI 580.159.03)
CUDA Version: (cuda_13.0.r13.0/compiler.36424714_0)

  1. Log Output The engine crashes immediately when processing a multi-modal request:

(APIServer pid=1) INFO: Running: 1 reqs, MM cache hit rate: 33.3%
!!! Segfault encountered !!!
File “”, line 0, in cuLaunchKernel
File “”, line 0, in cublasLtTSTMatmul
File “”, line 0, in cublasGemmEx

(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] Traceback (most recent call last):
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 660, in output_handler
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

  1. Steps to Reproduce

Start the container using the v0.21.0-ubuntu2404 image on an RTX 5080.
Load chandra-ocr-2 model.
Send a chat completion request containing an image (OCR task).
The server hits a Segfault as soon as the Vision Encoder/Linear layers are triggered.

  1. Additional Context I suspect the issue is a mismatch between the Blackwell GPU architecture and the CUDA kernels/cublasLt versions bundled in the v0.21.0 image. The crash occurs exactly during bfloat16 gemm operations.

Your segmentation fault during multi-modal (OCR) inference with chandra-ocr-2 on an RTX 5080 (Blackwell, compute 10.0) in vLLM v0.21.0 is likely due to CUDA kernel or cuBLAS incompatibility with the new Blackwell architecture. The crash at cublasLtTSTMatmul (bfloat16 GEMM) matches known issues where CUDA libraries or precompiled kernels do not support the latest GPU compute capabilities, resulting in hard faults during matrix multiplication in the vision encoder stage. This is consistent with similar failures reported for Blackwell GPUs and BF16 GEMM in vLLM and other LLM serving stacks.

There is no explicit fix in the vLLM documentation for this specific hardware/software combination. The recommended actions are: (1) ensure you are using the latest NVIDIA driver and CUDA toolkit that fully support Blackwell (compute 10.0), (2) rebuild vLLM and all dependencies (especially PyTorch and CUDA extensions) from source with CUDA 13.x and Blackwell support, and (3) monitor for upstream vLLM releases that add official Blackwell compatibility. This is not a model or user error, but a platform support gap in the current vLLM release and its CUDA stack. See similar root causes in vLLM issue #40661 and troubleshooting docs.

Would you like a step-by-step guide for rebuilding vLLM and dependencies for Blackwell GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.