Cuda 12.8 not working vLLM

Hi everyone, I am relatively new to this stuff so please pardon any mistakes.

I am trying to run qwen3.5 32b and glm 4.7 flash models on Nvidia H100 which has cuda 12.8

However, the vllm version I have (0.22) seems to be compiled against cuda 12.9. The docs mention this and also say how we can get binaries compiled against other versions, but those don’t seem to work for cuda 12.8 or have a 404 error. So I am wondering does vllm not support 12.8 anymore or are there other ways to achieve this ?

Command I am trying -

vllm serve Qwen/Qwen3.5-35B-A3B
–tensor-parallel-size 1
–max-model-len 32768
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–served-model-name qwen3.5-35b
–safetensors-load-strategy prefetch
–enforce-eager"

Error - “CUDA driver version is insufficient for CUDA runtime version”.

Please let me know incase any other information is needed to help out. Thanks!

vLLM pre-built wheels are now compiled mainly for CUDA 12.9 and 13.0; official CUDA 12.8 wheels are often missing or return 404, as you’ve seen. For H100 (which requires at least CUDA 12.8), if you need CUDA 12.8 support and can’t upgrade your driver, you must build vLLM from source with your specific CUDA/PyTorch versions. The error “CUDA driver version is insufficient for CUDA runtime version” means your driver is too old for the CUDA runtime in the vLLM wheel you installed, or the wheel is not compatible with your CUDA version. This is a common issue when the CUDA toolkit and driver versions do not match the pre-built binary requirements.

To resolve this, either (1) upgrade your driver and CUDA to match the pre-built wheel (CUDA 12.9+), or (2) build vLLM from source on your system with CUDA 12.8 and your installed PyTorch. See the official installation docs and related GitHub issues for step-by-step instructions. Would you like detailed build-from-source steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM can’t i downgrade vllm version and find pre-built wheels for cuda 12.8 ?

You can try downgrading vLLM to an earlier version that provides pre-built wheels for CUDA 12.8, but recent releases (e.g., v0.18.0 and v0.17.0) often only have wheels for CUDA 13.0, not 12.8—attempts to download 12.8 wheels usually result in 404 errors. The only reliable way for CUDA 12.8 is to build vLLM from source, as confirmed by users and maintainers in this GitHub issue.

Would you like detailed instructions for building vLLM from source for CUDA 12.8?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM what if i downgrade even lower

i just need support to run qwen3.5 32b and glm 4.7 flash

You can downgrade and install vllm 0.21.0

correct me if I am wrong, but that also doesn’t have 12.8 pre-built wheels right ?

Pardon me, I just checked my env, I have VLLM: 0.21 PyTorch: 2.11.0+cu130, CUDA: 13.0.

vLLM 0.21/0.22 wheels are compiled against CUDA 12.9+/13.0, which needs a newer driver.
Fix: Update your NVIDIA driver to 570+ (for CUDA 12.9) or 575+ (for CUDA 13.0)

yeah unfortunately, I don’t have root access. I’ll try building my own vllm from source. Thanks for the help though!