# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B

Summary:
On SM120 (RTX PRO 6000), the only reliable and performant backend for Qwen3.5-397B-A17B-NVFP4 is Marlin W4A16 (FP4 weights dequantized to FP16), yielding ~50.5 tok/s with TP=4 and no MTP. All native NVFP4 MoE backends using CUTLASS or FlashInfer fail on SM120 due to TMA Warp Specialized grouped GEMM kernel initialization errors, resulting in garbage output or severe performance degradation. MTP speculative decoding is counterproductive with Marlin due to activation distribution mismatch, causing a -22% throughput drop. There is currently no evidence in the data sources that native NVFP4 MoE (not Marlin) works correctly on SM120, nor that any community fork achieves >50 tok/s sustained decode with correct output using native FP4 compute. The root cause is a CUTLASS kernel bug tracked in CUTLASS #3096.

Configuration:
Use Marlin via --moe-backend marlin or export VLLM_MOE_FORCE_MARLIN=1 and avoid MTP. Required environment variables for WSL2: NCCL_P2P_DISABLE=1, NCCL_CUMEM_ENABLE=0, VLLM_WORKER_MULTIPROC_METHOD=spawn. No alternative MoE kernel backend (cuDNN, Triton, etc.) is available for NVFP4 on SM120 as per current data. The is_device_capability_family(120) patch is necessary for backend selection, but does not fix the underlying kernel issue. For further details or step-by-step reasoning, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.