Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

RunLLM · February 2, 2026, 12:41pm

The error ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device means the vLLM NVFP4 MoE kernel does not yet support your Blackwell (SM120) GPUs out-of-the-box. This is a known issue for Nemotron-3-Nano-30B-A3B-NVFP4 and similar models on RTX 5080/5070 Ti, as the required FlashInfer/CUTLASS kernels for SM120 are not present in the official vLLM Docker images or wheels. The error is not specific to WSL or Docker, but to missing kernel support for SM120 in the backend you are using (source, source).

To resolve this, you must build vLLM from source with the correct CUDA architecture flags (e.g., torch_cuda_arch_list="12.0 12.1") and use a vLLM version that includes the merged PR for SM120 support in FlashInfer CUTLASS NVFP4 MoE kernels (source). Official Docker images and prebuilt wheels do not yet include these kernels for SM120, so you cannot use vllm/vllm-openai:latest directly. See this build example for the correct Docker build command. After building, your command should work as expected.

Would you like a step-by-step guide for building vLLM with SM120 support for your GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
EngineCore Error with NVIDIA-Nemotron-3-Super-120B-A12B-FP8 on 2*H100 General	2	111	April 21, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6816	January 5, 2026
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	306	November 25, 2025
Project: vLLM docker for running smoothly on RTX 5090 + WSL2 General	2	888	December 6, 2025
Docker explosion this morning after it worked fine for a long while NVIDIA GPU Support	6	543	May 6, 2025

Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

Related topics