Qwen/Qwen3-Next-80B-A3B-Instruct can run on b200 and gb200?

mispa-ms · September 23, 2025, 11:06pm

When trying to run qwen3-next on b200 and gb200, I got the below error with the below command. Does vLLM support Qwen3-next running on blackwell gpus? Thank you.

python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8087 --model Qwen/Qwen3-Next-80B-A3B-Instruct --tokenizer Qwen/Qwen3-Next-80B-A3B-Instruct --dtype auto --kv-cache-dtype auto --tensor-parallel-size 8 --pipeline-parallel-size 1 --swap-space 16 --max-num-seqs 1 --trust-remote-code --max-model-len 3020

RuntimeError: Expect (32 <= headDim <= 2048) && (numTokensPerPage <= 128), got headDimPerCtaV=%d, headDimQk=%d, headDimV=%d, numTokensPerPage=%d256256256144

RunLLM · September 23, 2025, 11:07pm

Yes, vLLM supports running Qwen3-Next on Blackwell GPUs (B200/GB200), but your error is related to an unsupported attention kernel configuration, not a lack of model support. The official vLLM recipes document successful Qwen3-Next launches on H200, H20, and A100/A800, and Blackwell support is implied for similar architectures, but you must ensure your CUDA, NCCL, and vLLM versions are up to date and that your configuration matches the recommended settings. The error you see (regarding headDim and numTokensPerPage) suggests an incompatibility with the current attention backend or kernel for your chosen model and hardware setup. Try reducing tensor parallel size, adjusting max-model-len, or updating vLLM and dependencies as per the Qwen3-Next usage guide.

If you need optimal performance on Blackwell, also check for any required environment variables or kernel tuning (e.g., MoE kernel tuning) as described in the guide. If the error persists, it may be a kernel bug or a limitation in the current vLLM/FlashAttention/FlashInfer stack for Blackwell, in which case you may need to wait for an upstream fix or use a different attention backend as a workaround.

Would you like a step-by-step troubleshooting guide for this setup?

Sources:

Qwen3-Next usage guide

mispa-ms · September 24, 2025, 5:05pm

the Qwen3-Next usage guide doesn’t say that it can run on blackwell, no? I know it’s find on h100/200. Based on the logs of qwen3-next on hopper, it shows that it uses flashinfer just like the error from b200. Also, I didn’t set attention backend in my command (shown in the original question). Can you plz give me a link/ command for qwen3-next on blackwell ?

Topic		Replies	Views
Does vllm inference work with Qwen3-VL-30B General	3	503	November 30, 2025
H20 running Qwen3-30B-A3B-AWQ failed General	2	122	November 27, 2025
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6663	January 5, 2026
RTX Pro 6000运行Qwen3-32B-FP8报错 General	1	655	July 4, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	744	October 21, 2025

Qwen/Qwen3-Next-80B-A3B-Instruct can run on b200 and gb200?

Related topics