1. Describe the bug I am experiencing a hard Segmentation Fault when running an OCR task (chandra-ocr-2) using the V1 Engine in vLLM v0.21.0. The crash happens during the forward pass of a multi-modal request, specifically at the matrix multiplication stage (cublasLtTSTMatmul).
2. Environment
-
vLLM Image:
vllm/vllm-openai:v0.21.0-ubuntu2404 -
Engine: V1 Engine (enabled by default in this tag’s configuration)
-
Model:
chandra-ocr-2(Multi-modal/OCR) -
GPU: NVIDIA GeForce RTX 5080 (Blackwell Architecture, Compute Capability 10.0)
-
Driver Version: (Please insert your
nvidia-smiversion here) -
CUDA Version: (Inside Docker:
nvcc --version)
3. Log Output The engine crashes immediately when processing a multi-modal request:
Plaintext
(APIServer pid=1) INFO: Running: 1 reqs, MM cache hit rate: 33.3%
!!!!!!! Segfault encountered !!!!!!!
File "<unknown>", line 0, in cuLaunchKernel
File "<unknown>", line 0, in cublasLtTSTMatmul
File "<unknown>", line 0, in cublasGemmEx
...
(APIServer pid=1) ERROR: vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
4. Steps to Reproduce
-
Start the container using the
v0.21.0-ubuntu2404image on an RTX 5080. -
Load
chandra-ocr-2model. -
Send a chat completion request containing an image (OCR task).
-
The server hits a Segfault as soon as the Vision Encoder/Linear layers are triggered.
5. Additional Context I suspect the issue is a mismatch between the Blackwell GPU architecture and the CUDA kernels/cublasLt versions bundled in the v0.21.0 image. The crash occurs exactly during bfloat16 gemm operations.