[Bug] Segfault in cublasLt/cuLaunchKernel on RTX 5080 using v0.21.0 (V1 Engine)

JiYong · May 31, 2026, 3:59pm

1. Describe the bug I am experiencing a hard Segmentation Fault when running an OCR task (chandra-ocr-2) using the V1 Engine in vLLM v0.21.0. The crash happens during the forward pass of a multi-modal request, specifically at the matrix multiplication stage (cublasLtTSTMatmul).

2. Environment

vLLM Image: vllm/vllm-openai:v0.21.0-ubuntu2404
Engine: V1 Engine (enabled by default in this tag’s configuration)
Model: chandra-ocr-2 (Multi-modal/OCR)
GPU: NVIDIA GeForce RTX 5080 (Blackwell Architecture, Compute Capability 10.0)
Driver Version: (Please insert your nvidia-smi version here)
CUDA Version: (Inside Docker: nvcc --version)

3. Log Output The engine crashes immediately when processing a multi-modal request:

Plaintext

(APIServer pid=1) INFO: Running: 1 reqs, MM cache hit rate: 33.3%
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuLaunchKernel
  File "<unknown>", line 0, in cublasLtTSTMatmul
  File "<unknown>", line 0, in cublasGemmEx
  ...
(APIServer pid=1) ERROR: vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

4. Steps to Reproduce

Start the container using the v0.21.0-ubuntu2404 image on an RTX 5080.
Load chandra-ocr-2 model.
Send a chat completion request containing an image (OCR task).
The server hits a Segfault as soon as the Vision Encoder/Linear layers are triggered.

5. Additional Context I suspect the issue is a mismatch between the Blackwell GPU architecture and the CUDA kernels/cublasLt versions bundled in the v0.21.0 image. The crash occurs exactly during bfloat16 gemm operations.

RunLLM · May 31, 2026, 4:00pm

This segmentation fault is likely due to missing or incompatible CUDA kernel support for the Blackwell (RTX 5080, compute capability 10.0) architecture in the vLLM v0.21.0 Docker image. The official vLLM wheels and Docker images may not yet include precompiled kernels or cublasLt support for Blackwell GPUs, leading to crashes during matrix multiplication in multi-modal (OCR) models like chandra-ocr-2, especially at bfloat16 GEMM stages. This is a known issue for new GPU architectures until explicit support is added in vLLM and its dependencies (see related issues, 18916, 22325).

Workarounds include: (1) building vLLM from source with the correct CUDA arch flags (e.g., torch_cuda_arch_list="10.0"), (2) ensuring your CUDA, cuBLAS, and PyTorch versions are the latest and support Blackwell, and (3) waiting for an official vLLM release with Blackwell support in wheels and Docker images. This issue is not specific to your model, but to the platform compatibility (details, 22325). Would you like detailed build instructions or troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Segfault in cublasLt/cuLaunchKernel on RTX 5080 using v0.21.0 (V1 Engine) General	1	23	May 31, 2026
[Bug] Segfault in PythonSymNodeImpl and Deadlock on RTX 5090 (Blackwell) with vLLM 0.11.2 General	1	38	June 1, 2026
Is the reason for my vllm 0.20.0 failing to start because of nixl? General	9	258	June 3, 2026
Why is this not working? I corrected it but still NVIDIA GPU Support	1	926	May 8, 2025
Docker image `vllm/vllm-openai:v0.9.0` doesn't work on 5090 General	3	925	June 10, 2025

[Bug] Segfault in cublasLt/cuLaunchKernel on RTX 5080 using v0.21.0 (V1 Engine)

Related topics