[Bug] Segfault in cublasLt/cuLaunchKernel on RTX 5080 using v0.21.0 (V1 Engine)

1. Describe the bug I am experiencing a hard Segmentation Fault when running an OCR task (chandra-ocr-2) using the V1 Engine in vLLM v0.21.0. The crash happens during the forward pass of a multi-modal request, specifically at the matrix multiplication stage (cublasLtTSTMatmul).

2. Environment

  • vLLM Image: vllm/vllm-openai:v0.21.0-ubuntu2404

  • Engine: V1 Engine (enabled by default in this tag’s configuration)

  • Model: chandra-ocr-2 (Multi-modal/OCR)

  • GPU: NVIDIA GeForce RTX 5080 (Blackwell Architecture, Compute Capability 10.0)

  • Driver Version: (Please insert your nvidia-smi version here)

  • CUDA Version: (Inside Docker: nvcc --version)

3. Log Output The engine crashes immediately when processing a multi-modal request:

Plaintext

(APIServer pid=1) INFO: Running: 1 reqs, MM cache hit rate: 33.3%
!!!!!!! Segfault encountered !!!!!!!
  File "<unknown>", line 0, in cuLaunchKernel
  File "<unknown>", line 0, in cublasLtTSTMatmul
  File "<unknown>", line 0, in cublasGemmEx
  ...
(APIServer pid=1) ERROR: vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

4. Steps to Reproduce

  1. Start the container using the v0.21.0-ubuntu2404 image on an RTX 5080.

  2. Load chandra-ocr-2 model.

  3. Send a chat completion request containing an image (OCR task).

  4. The server hits a Segfault as soon as the Vision Encoder/Linear layers are triggered.

5. Additional Context I suspect the issue is a mismatch between the Blackwell GPU architecture and the CUDA kernels/cublasLt versions bundled in the v0.21.0 image. The crash occurs exactly during bfloat16 gemm operations.

This segmentation fault is likely due to missing or incompatible CUDA kernel support for the Blackwell (RTX 5080, compute capability 10.0) architecture in the vLLM v0.21.0 Docker image. The official vLLM wheels and Docker images may not yet include precompiled kernels or cublasLt support for Blackwell GPUs, leading to crashes during matrix multiplication in multi-modal (OCR) models like chandra-ocr-2, especially at bfloat16 GEMM stages. This is a known issue for new GPU architectures until explicit support is added in vLLM and its dependencies (see related issues, 18916, 22325).

Workarounds include: (1) building vLLM from source with the correct CUDA arch flags (e.g., torch_cuda_arch_list="10.0"), (2) ensuring your CUDA, cuBLAS, and PyTorch versions are the latest and support Blackwell, and (3) waiting for an official vLLM release with Blackwell support in wheels and Docker images. This issue is not specific to your model, but to the platform compatibility (details, 22325). Would you like detailed build instructions or troubleshooting steps?

Sources:

Hint: Mention @RunLLM in the post for followups.