Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

The error ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device means the vLLM NVFP4 MoE kernel does not yet support your Blackwell (SM120) GPUs out-of-the-box. This is a known issue for Nemotron-3-Nano-30B-A3B-NVFP4 and similar models on RTX 5080/5070 Ti, as the required FlashInfer/CUTLASS kernels for SM120 are not present in the official vLLM Docker images or wheels. The error is not specific to WSL or Docker, but to missing kernel support for SM120 in the backend you are using (source, source).

To resolve this, you must build vLLM from source with the correct CUDA architecture flags (e.g., torch_cuda_arch_list="12.0 12.1") and use a vLLM version that includes the merged PR for SM120 support in FlashInfer CUTLASS NVFP4 MoE kernels (source). Official Docker images and prebuilt wheels do not yet include these kernels for SM120, so you cannot use vllm/vllm-openai:latest directly. See this build example for the correct Docker build command. After building, your command should work as expected.

Would you like a step-by-step guide for building vLLM with SM120 support for your GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.