When trying to run qwen3-next on b200 and gb200, I got the below error with the below command. Does vLLM support Qwen3-next running on blackwell gpus? Thank you.
Yes, vLLM supports running Qwen3-Next on Blackwell GPUs (B200/GB200), but your error is related to an unsupported attention kernel configuration, not a lack of model support. The official vLLM recipes document successful Qwen3-Next launches on H200, H20, and A100/A800, and Blackwell support is implied for similar architectures, but you must ensure your CUDA, NCCL, and vLLM versions are up to date and that your configuration matches the recommended settings. The error you see (regarding headDim and numTokensPerPage) suggests an incompatibility with the current attention backend or kernel for your chosen model and hardware setup. Try reducing tensor parallel size, adjusting max-model-len, or updating vLLM and dependencies as per the Qwen3-Next usage guide.
If you need optimal performance on Blackwell, also check for any required environment variables or kernel tuning (e.g., MoE kernel tuning) as described in the guide. If the error persists, it may be a kernel bug or a limitation in the current vLLM/FlashAttention/FlashInfer stack for Blackwell, in which case you may need to wait for an upstream fix or use a different attention backend as a workaround.
Would you like a step-by-step troubleshooting guide for this setup?
the Qwen3-Next usage guide doesn’t say that it can run on blackwell, no? I know it’s find on h100/200. Based on the logs of qwen3-next on hopper, it shows that it uses flashinfer just like the error from b200. Also, I didn’t set attention backend in my command (shown in the original question). Can you plz give me a link/ command for qwen3-next on blackwell ?