Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

trevor · February 2, 2026, 12:40pm

Hello, I am trying to run the NVIDIA Nemotron 3 Nano NVFP4 model across 2x Blackwell GPUs installed in my system. The model appears to be only about 20 GB in size, so it ought to work with my dual-GPU configuration with 32 GB of VRAM, minus the 2-4 GB of OS overhead.

System Configuration

Motherboard: Gigabyte Aorus B650E Elite ICE AX (firmware version F41, latest, January 2026)
CPU: AMD Ryzen 9 9950X
Memory: 128 GB (4x32 GB) TeamGroup DDR5
GPU1: NVIDIA GeForce RTX 5080 (16 GB VRAM)
GPU0: NVIDIA GeForce RTX 5070 Ti (16 GB VRAM)
NVIDIA driver 591.86
Windows 11 25H2 fully patched
Docker Desktop v4.58.0 (latest)

WSL version: 2.5.9.0
Kernel version: 6.6.87.2-1
WSLg version: 1.0.66
MSRDC version: 1.2.6074
Direct3D version: 1.611.1-81528511
DXCore version: 10.0.26100.1-240331-1435.ge-release
Windows version: 10.0.26200.7623

Here’s the nvidia-smi output from the Docker Desktop Windows Subsystem for Linux (WSL) environment:

Actual Result

According to Google AI Studio, the following vLLM container command should launch NVIDIA Nemotron 3 Nano NVFP4 successfully:

docker run --gpus all `
    --rm `
    -v "C:\git\NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4:/model" `
    -p 8000:8000 `
    --env VLLM_USE_FLASHINFER_MOE_FP4=1 `
    --env VLLM_FLASHINFER_MOE_BACKEND=throughput `
    --env CUDA_DEVICE_ORDER=PCI_BUS_ID `
    --ipc=host `
    vllm/vllm-openai:latest `
    --model /model `
    --served-model-name nemotron `
    --max-model-len 131072 `
    --max-num-seqs 8 `
    --kv-cache-dtype fp8 `
    --trust-remote-code `
    --reasoning-parser-plugin "/model/nano_v3_reasoning_parser.py" `
    --reasoning-parser nano_v3

When I run this command, I get the following errors:

Starting to load model /model...
Using flashinfer-cutlass for NVFP4 GEMM
Process EngineCore_DP0:
EngineCore failed to start.
ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device.

Expected Result

The NVFP4 model should run on both NVIDIA Blackwell GPUs installed in my local system, under the Windows 11 WSL environment, via Docker Desktop.

Does anyone have any ideas on how to resolve this and correctly run NVIDIA Nemotron 3 Nano NVFP4?

RunLLM · February 2, 2026, 12:41pm

The error ValueError: NvFp4 MoE backend 'FLASHINFER_CUTLASS' does not support the deployment configuration since kernel does not support current device means the vLLM NVFP4 MoE kernel does not yet support your Blackwell (SM120) GPUs out-of-the-box. This is a known issue for Nemotron-3-Nano-30B-A3B-NVFP4 and similar models on RTX 5080/5070 Ti, as the required FlashInfer/CUTLASS kernels for SM120 are not present in the official vLLM Docker images or wheels. The error is not specific to WSL or Docker, but to missing kernel support for SM120 in the backend you are using (source, source).

To resolve this, you must build vLLM from source with the correct CUDA architecture flags (e.g., torch_cuda_arch_list="12.0 12.1") and use a vLLM version that includes the merged PR for SM120 support in FlashInfer CUTLASS NVFP4 MoE kernels (source). Official Docker images and prebuilt wheels do not yet include these kernels for SM120, so you cannot use vllm/vllm-openai:latest directly. See this build example for the correct Docker build command. After building, your command should work as expected.

Would you like a step-by-step guide for building vLLM with SM120 support for your GPUs?

Sources:

Hint: Mention @RunLLM in the post for followups.

trevor · February 2, 2026, 4:00pm

I got it working using NVIDIA’s latest container image for vLLM.

docker run `
    --gpus all `
    -v "C:\git\NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4:/model" `
    -p 8000:8000 `
    --env "VLLM_USE_FLASHINFER_MOE_FP4=1" `
    --env "VLLM_FLASHINFER_MOE_BACKEND=throughput" `
    --env CUDA_DEVICE_ORDER=PCI_BUS_ID `
    --env "CUDA_VISIBLE_DEVICES=0,1" `
    --ipc=host `
    nvcr.io/nvidia/vllm:26.01-py3 `
    vllm serve `
    --model /model `
    --served-model-name nemotron `
    --max-model-len 30000 `
    --max-num-seqs 8 `
    --kv-cache-dtype fp8 `
    --trust-remote-code `
    --reasoning-parser-plugin "/model/nano_v3_reasoning_parser.py" `
    --reasoning-parser nano_v3 `
    --tensor-parallel-size 2

Topic		Replies	Views
EngineCore Error with NVIDIA-Nemotron-3-Super-120B-A12B-FP8 on 2*H100 General	2	57	April 21, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6301	January 5, 2026
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	289	November 25, 2025
Project: vLLM docker for running smoothly on RTX 5090 + WSL2 General	2	829	December 6, 2025
Docker explosion this morning after it worked fine for a long while NVIDIA GPU Support	6	504	May 6, 2025

Running NVFP4 Nemotron model on Win11/WSL RTX 5080 + 5070 Ti

System Configuration

Actual Result

Expected Result

Related topics