Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

Hello, I am trying to run the NVIDIA Nemotron 3 Nano NVFP4 model across 2x Blackwell GPUs installed in my system. The model appears to be only about 20 GB in size, so it ought to work with my dual-GPU configuration with 32 GB of VRAM, minus the 2-4 GB of OS overhead.

System Configuration

  • Motherboard: Gigabyte Aorus B650E Elite ICE AX (firmware version F41, latest, January 2026)
  • CPU: AMD Ryzen 9 9950X
  • Memory: 128 GB (4x32 GB) TeamGroup DDR5
  • GPU1: NVIDIA GeForce RTX 5080 (16 GB VRAM)
  • GPU0: NVIDIA GeForce RTX 5070 Ti (16 GB VRAM)
  • NVIDIA driver 591.86
  • Windows 11 25H2 fully patched
  • Docker Desktop v4.58.0 (latest)
WSL version: 2.5.9.0
Kernel version: 6.6.87.2-1
WSLg version: 1.0.66
MSRDC version: 1.2.6074
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26100.1-240331-1435.ge-release
Windows version: 10.0.26200.7623

Here’s the nvidia-smi output from the Docker Desktop Windows Subsystem for Linux (WSL) environment:

Actual Result

According to Google AI Studio, the following vLLM container command should launch NVIDIA Nemotron 3 Nano NVFP4 successfully:

docker run --gpus all `
    --rm `
    -v "C:\git\NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4:/model" `
    -p 8000:8000 `
    --env VLLM_USE_FLASHINFER_MOE_FP4=1 `
    --env VLLM_FLASHINFER_MOE_BACKEND=throughput `
    --env CUDA_DEVICE_ORDER=PCI_BUS_ID `
    --ipc=host `
    vllm/vllm-openai:latest `
    --model /model `
    --served-model-name nemotron `
    --max-model-len 131072 `
    --max-num-seqs 8 `
    --kv-cache-dtype fp8 `
    --trust-remote-code `
    --reasoning-parser-plugin "/model/nano_v3_reasoning_parser.py" `
    --reasoning-parser nano_v3

When I run this command, I get the following errors:

Starting to load model /model...
Using flashinfer-cutlass for NVFP4 GEMM
Process EngineCore_DP0:
EngineCore failed to start.
ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device.

Expected Result

The NVFP4 model should run on both NVIDIA Blackwell GPUs installed in my local system, under the Windows 11 WSL environment, via Docker Desktop.

Does anyone have any ideas on how to resolve this and correctly run NVIDIA Nemotron 3 Nano NVFP4?

The error ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device means the vLLM NVFP4 MoE kernel does not yet support your Blackwell (SM120) GPUs out-of-the-box. This is a known issue for Nemotron-3-Nano-30B-A3B-NVFP4 and similar models on RTX 5080/5070 Ti, as the required FlashInfer/CUTLASS kernels for SM120 are not present in the official vLLM Docker images or wheels. The error is not specific to WSL or Docker, but to missing kernel support for SM120 in the backend you are using (source, source).

To resolve this, you must build vLLM from source with the correct CUDA architecture flags (e.g., torch_cuda_arch_list="12.0 12.1") and use a vLLM version that includes the merged PR for SM120 support in FlashInfer CUTLASS NVFP4 MoE kernels (source). Official Docker images and prebuilt wheels do not yet include these kernels for SM120, so you cannot use vllm/vllm-openai:latest directly. See this build example for the correct Docker build command. After building, your command should work as expected.

Would you like a step-by-step guide for building vLLM with SM120 support for your GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.

I got it working using NVIDIA’s latest container image for vLLM.

docker run `
    --gpus all `
    -v "C:\git\NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4:/model" `
    -p 8000:8000 `
    --env "VLLM_USE_FLASHINFER_MOE_FP4=1" `
    --env "VLLM_FLASHINFER_MOE_BACKEND=throughput" `
    --env CUDA_DEVICE_ORDER=PCI_BUS_ID `
    --env "CUDA_VISIBLE_DEVICES=0,1" `
    --ipc=host `
    nvcr.io/nvidia/vllm:26.01-py3 `
    vllm serve `
    --model /model `
    --served-model-name nemotron `
    --max-model-len 30000 `
    --max-num-seqs 8 `
    --kv-cache-dtype fp8 `
    --trust-remote-code `
    --reasoning-parser-plugin "/model/nano_v3_reasoning_parser.py" `
    --reasoning-parser nano_v3 `
    --tensor-parallel-size 2