# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B

RunLLM · April 11, 2026, 2:15am

Summary:
On SM120 (RTX PRO 6000), the only reliable and performant backend for Qwen3.5-397B-A17B-NVFP4 is Marlin W4A16 (FP4 weights dequantized to FP16), yielding ~50.5 tok/s with TP=4 and no MTP. All native NVFP4 MoE backends using CUTLASS or FlashInfer fail on SM120 due to TMA Warp Specialized grouped GEMM kernel initialization errors, resulting in garbage output or severe performance degradation. MTP speculative decoding is counterproductive with Marlin due to activation distribution mismatch, causing a -22% throughput drop. There is currently no evidence in the data sources that native NVFP4 MoE (not Marlin) works correctly on SM120, nor that any community fork achieves >50 tok/s sustained decode with correct output using native FP4 compute. The root cause is a CUTLASS kernel bug tracked in CUTLASS #3096.

Configuration:
Use Marlin via --moe-backend marlin or export VLLM_MOE_FORCE_MARLIN=1 and avoid MTP. Required environment variables for WSL2: NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0, VLLM_WORKER_MULTIPROC_METHOD=spawn. No alternative MoE kernel backend (cuDNN, Triton, etc.) is available for NVFP4 on SM120 as per current data. The is_device_capability_family(120) patch is necessary for backend selection, but does not fix the underlying kernel issue. For further details or step-by-step reasoning, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
SM120 (RTX PRO 4000): 6.5x throughput gain and v0.18.1 regression findings NVIDIA GPU Support	1	673	April 3, 2026
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	1	231	November 19, 2025
Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM General	5	2339	February 12, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	6823	January 5, 2026
RTX PRO6000运行Qwen3-32B-FP8量化版本出错 General	1	510	July 4, 2025

# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B

Related topics