Cuda 12.8 not working vLLM

Hello123 · June 4, 2026, 3:57pm

Hi everyone, I am relatively new to this stuff so please pardon any mistakes.

I am trying to run qwen3.5 32b and glm 4.7 flash models on Nvidia H100 which has cuda 12.8

However, the vllm version I have (0.22) seems to be compiled against cuda 12.9. The docs mention this and also say how we can get binaries compiled against other versions, but those don’t seem to work for cuda 12.8 or have a 404 error. So I am wondering does vllm not support 12.8 anymore or are there other ways to achieve this ?

Command I am trying -

vllm serve Qwen/Qwen3.5-35B-A3B
–tensor-parallel-size 1
–max-model-len 32768
–enable-auto-tool-choice
–tool-call-parser qwen3_coder
–reasoning-parser qwen3
–served-model-name qwen3.5-35b
–safetensors-load-strategy prefetch
–enforce-eager"

Error - “CUDA driver version is insufficient for CUDA runtime version”.

Please let me know incase any other information is needed to help out. Thanks!

RunLLM · June 4, 2026, 3:58pm

vLLM pre-built wheels are now compiled mainly for CUDA 12.9 and 13.0; official CUDA 12.8 wheels are often missing or return 404, as you’ve seen. For H100 (which requires at least CUDA 12.8), if you need CUDA 12.8 support and can’t upgrade your driver, you must build vLLM from source with your specific CUDA/PyTorch versions. The error “CUDA driver version is insufficient for CUDA runtime version” means your driver is too old for the CUDA runtime in the vLLM wheel you installed, or the wheel is not compatible with your CUDA version. This is a common issue when the CUDA toolkit and driver versions do not match the pre-built binary requirements.

To resolve this, either (1) upgrade your driver and CUDA to match the pre-built wheel (CUDA 12.9+), or (2) build vLLM from source on your system with CUDA 12.8 and your installed PyTorch. See the official installation docs and related GitHub issues for step-by-step instructions. Would you like detailed build-from-source steps?

Sources:

Hint: Mention @RunLLM in the post for followups.

Hello123 · June 4, 2026, 4:28pm

@RunLLM can’t i downgrade vllm version and find pre-built wheels for cuda 12.8 ?

RunLLM · June 4, 2026, 4:28pm

You can try downgrading vLLM to an earlier version that provides pre-built wheels for CUDA 12.8, but recent releases (e.g., v0.18.0 and v0.17.0) often only have wheels for CUDA 13.0, not 12.8—attempts to download 12.8 wheels usually result in 404 errors. The only reliable way for CUDA 12.8 is to build vLLM from source, as confirmed by users and maintainers in this GitHub issue.

Would you like detailed instructions for building vLLM from source for CUDA 12.8?

Sources:

GitHub Issue: v0.18.0 cu128 release wheel URL returns 404

Hint: Mention @RunLLM in the post for followups.

Hello123 · June 4, 2026, 4:29pm

@RunLLM what if i downgrade even lower

i just need support to run qwen3.5 32b and glm 4.7 flash

ManikandanB · June 4, 2026, 4:51pm

You can downgrade and install vllm 0.21.0

Hello123 · June 4, 2026, 4:55pm

correct me if I am wrong, but that also doesn’t have 12.8 pre-built wheels right ?

ManikandanB · June 4, 2026, 5:36pm

Pardon me, I just checked my env, I have VLLM: 0.21 PyTorch: 2.11.0+cu130, CUDA: 13.0.

vLLM 0.21/0.22 wheels are compiled against CUDA 12.9+/13.0, which needs a newer driver.
Fix: Update your NVIDIA driver to 570+ (for CUDA 12.9) or 575+ (for CUDA 13.0)

Hello123 · June 4, 2026, 5:38pm

yeah unfortunately, I don’t have root access. I’ll try building my own vllm from source. Thanks for the help though!

Topic		Replies	Views
How can we use latest vllm if we are using older drivers which only support cuda 12 General	3	108	May 27, 2026
求救各位大佬看看是什么问题。cuda12.9，pytorch2.8，vllm0.11.0 General	3	251	November 14, 2025
Install using --torch-backend=cu129 but try to import cu13 General	8	1637	April 29, 2026
RuntimeError: CUDA driver initialization failed, you might not have a CUDA gpu General	1	62	June 5, 2026
求救，vllm怎么安装cuda13.0 General	3	1014	November 11, 2025

Cuda 12.8 not working vLLM

Related topics