-
Describe the bug I am experiencing a hard Segmentation Fault when running an OCR task (chandra-ocr-2) using the V1 Engine in vLLM v0.21.0. The crash happens during the forward pass of a multi-modal request, specifically at the matrix multiplication stage (cublasLtTSTMatmul).
-
Environment
vLLM Image: vllm/vllm-openai:v0.21.0-ubuntu2404
Engine: V1 Engine (enabled by default in this tag’s configuration)
Model: chandra-ocr-2 (Multi-modal/OCR)
GPU: NVIDIA GeForce RTX 5080 (Blackwell Architecture, Compute Capability 10.0)
Driver Version: (NVIDIA-SMI 580.159.03)
CUDA Version: (cuda_13.0.r13.0/compiler.36424714_0)
- Log Output The engine crashes immediately when processing a multi-modal request:
(APIServer pid=1) INFO: Running: 1 reqs, MM cache hit rate: 33.3%
!!! Segfault encountered !!!
File “”, line 0, in cuLaunchKernel
File “”, line 0, in cublasLtTSTMatmul
File “”, line 0, in cublasGemmEx
…
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] Traceback (most recent call last):
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 660, in output_handler
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 998, in get_output_async
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 05-29 12:02:09 [async_llm.py:704] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
- Steps to Reproduce
Start the container using the v0.21.0-ubuntu2404 image on an RTX 5080.
Load chandra-ocr-2 model.
Send a chat completion request containing an image (OCR task).
The server hits a Segfault as soon as the Vision Encoder/Linear layers are triggered.
- Additional Context I suspect the issue is a mismatch between the Blackwell GPU architecture and the CUDA kernels/cublasLt versions bundled in the v0.21.0 image. The crash occurs exactly during bfloat16 gemm operations.