# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report – Qwen3.5-397B
## TL;DR
Best sustained single-user decode on `nvidia/Qwen3.5-397B-A17B-NVFP4` with 4x RTX PRO 6000 (SM120): **50.5 tok/s** using Marlin W4A16 fallback, TP=4, no MTP, no speculative decoding. The native CUTLASS NVFP4 MoE path is broken on SM120 due to TMA WS grouped GEMM initialization failures. MTP actively hurts performance on the Marlin path (-22%). I have submitted PRs to both FlashInfer and vLLM.
-–
## What Works and What Doesn’t on SM120
### Works
- **Marlin W4A16 MoE backend** (TP=4): 50.5 tok/s sustained decode, correct output
- **FLASHINFER attention backend**: Stable with FP8 KV cache
- **CUDA graphs** (`enforce-eager=False`): Measurable improvement over eager mode
- **torch.compile**: Compatible, minor improvement
- **Prefix caching + chunked prefill**: Both work correctly
### Broken
- **FlashInfer CUTLASS NVFP4 MoE path**: All 80 TMA Warp Specialized grouped GEMM tactics fail at initialization. Falls back to slow non-TMA tactics producing garbage output (6-7 tok/s) or degraded throughput (40-41 tok/s when partially working via Docker images with SM120a/120f patches)
- **vLLM native CUTLASS MoE backend** (`VLLM_CUTLASS`): ~5 tok/s with garbage output
- **SGLang 0.5.8**: Produces NaN outputs on SM120
- **TensorRT-LLM v1.1.0**: `qwen3_5_moe` architecture not supported
- **Expert Parallel (EP=2)**: Catastrophic on PCIe – 1.4-2.6 tok/s
- **FlashInfer sampler**: 8.6x regression (5.9 tok/s vs 50.5 tok/s with default sampler)
### Counterproductive
- **MTP (Multi-Token Prediction)**: -22% throughput on Marlin path. Acceptance rates 61-85% vs 89% community baseline. Root cause: MTP draft heads were trained on native FP4 activations, but Marlin’s W4A16 dequantization produces numerically different values, causing mispredictions.
-–
## Best Configuration
```bash
vllm serve nvidia/Qwen3.5-397B-A17B-NVFP4 \
--tensor-parallel-size 4 \\
--max-model-len 262144 \\
--gpu-memory-utilization 0.95 \\
--enable-chunked-prefill \\
--enable-prefix-caching \\
--kv-cache-dtype fp8_e4m3 \\
--calculate-kv-scales \\
--compilation-config '{"level": 3}' \\
--host 0.0.0.0 --port 9100
# Force Marlin backend (critical – without this, SM120 gets CUTLASS which produces garbage):
export VLLM_MOE_FORCE_MARLIN=1
# Or use: --moe-backend marlin (but this crashed on some nightly builds)
```
Environment variables required for WSL2 multi-GPU:
```bash
NCCL_P2P_DISABLE=1
NCCL_CUMEM_ENABLE=0
VLLM_WORKER_MULTIPROC_METHOD=spawn
```
-–
## Full Benchmark Table
| # | Configuration | MoE Backend | TP | MTP | tok/s | Notes |
|—|--------------|-------------|-----|-----|-------|-------|
| 1 | Marlin TP=2+PP=2 | Marlin W4A16 | 2+PP2 | No | 49 | Correct output |
| 2 | orthozany Docker + MTP=2 | FlashInfer CUTLASS (120a) | 4 | Yes | 40 | 80 TMA WS tactics skipped |
| 3 | orthozany Docker + MTP=2 (warmed) | FlashInfer CUTLASS (120a) | 4 | Yes | 41 | 80 TMA WS tactics skipped |
| 4 | Festr Docker (120a) | FlashInfer CUTLASS (120a) | 4 | Yes | 26 | 80 TMA WS tactics skipped |
| 5 | Festr Docker + 120f | FlashInfer CUTLASS (120f) | 4 | Yes | 41 | 80 TMA WS tactics skipped |
| 6 | Festr Docker + GDC + 120f | FlashInfer CUTLASS (120f+GDC) | 4 | Yes | 41 | 80 TMA WS tactics skipped |
| 7 | VLLM_CUTLASS split backend | vLLM native CUTLASS | 4 | Yes | ~5 | Garbage output |
| 8 | Marlin via --moe-backend | Marlin W4A16 | 4 | Yes | CRASH | Unquantized MoE error |
| 9 | Marlin via FORCE_FP8_MARLIN | Marlin W4A16 | 4 | Yes | 44 | Low MTP acceptance |
| 10 | Pip nightly + Marlin + MTP=2 | Marlin W4A16 | 4 | Yes | 39-40 | MTP hurts (-22%) |
| **11** | **Pip nightly + Marlin + NO MTP** | **Marlin W4A16** | **4** | **No** | **50.5** | **Best config** |
| 12 | SGLang 0.5.8 | FlashInfer | 4 | – | NaN | NaN outputs |
| 13 | TP=4 default (no Marlin force) | FlashInfer CUTLASS | 4 | No | 6-7 | Garbage output |
| 14 | TP=2 + EP=2 | Marlin W4A16 | 2+EP2 | No | 1.4-2.6 | EP catastrophic on PCIe |
| 15 | TensorRT-LLM v1.1.0 | – | – | – | N/A | Arch not supported |
| 16 | FlashInfer Sampler | Marlin W4A16 | 4 | No | 5.9 | 8.6x regression |
-–
## Marlin vs CUTLASS on SM120
This is the core issue. On SM120:
- **Marlin W4A16**: Works correctly. Dequantizes FP4 weights to FP16, runs standard FP16 GEMM. 50.5 tok/s. No native FP4 compute – it is a fallback path that wastes half the theoretical throughput.
- **FlashInfer CUTLASS NVFP4**: Should use native FP4 tensor cores via TMA Warp Specialized grouped GEMM. On SM120, all 80 fast tactics fail at initialization with `Error Internal` on the generated CUTLASS kernel files. The autotuner falls through to non-TMA tactics which either produce garbage or run at 40 tok/s (still slower than Marlin, because the slow CUTLASS fallback path has higher overhead than Marlin’s simpler approach).
- **vLLM native CUTLASS**: Even worse – ~5 tok/s with garbage output. The `VLLM_CUTLASS` backend appears to have different kernel selection logic that picks even slower tactics.
The irony: Marlin, the “dumb” fallback that doesn’t even use FP4 tensor cores, outperforms every CUTLASS configuration because the CUTLASS fast path is broken.
-–
## The MTP Problem
MTP acceptance rates on Marlin are significantly lower than expected:
| Config | Without MTP | With MTP=2 | Acceptance Rate | Delta |
|--------|-------------|------------|-----------------|-------|
| Marlin TP=4 | 50.5 tok/s | 39.6 tok/s | 61-85% | -22% |
The MTP draft heads in Qwen3.5-397B were trained on native FP4 activations. Marlin’s W4A16 dequantization path produces subtly different activation values, causing the draft heads to mispredict at a rate high enough that the speculative execution overhead exceeds the benefit.
This means MTP is only useful if the native CUTLASS NVFP4 path works – which it doesn’t on SM120. There is no workaround within vLLM today.
-–
## Patches Required (12 Total)
### FlashInfer (7 patches) – submitted as [PR #2725]( fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by brandonmmusic-max · Pull Request #2725 · flashinfer-ai/flashinfer · GitHub )
1. `trtllm_fused_moe_kernel_launcher.cu` – SM version check `ICHECK_EQ(major, 10)` to `ICHECK_GE(major, 10)`
2. `compilation_context.py` – SM120 `compute_120f` suffix
3. `jit/fused_moe.py` – Add major version 12 to supported versions
4. `jit/fused_moe.py` – GDC compile flags
5. `jit/gemm/core.py` – GDC flags for dense GEMM
6. `cutlass/python/CuTeDSL/` – `sm_120a` in `admissible_archs` (5 files, 18 locations)
7. `cutlass/python/CuTeDSL/base_dsl/runtime/cuda.py` – `(12, 0)` device mapping
### vLLM (5 patches) – submitted as [PR #36453]( fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends by brandonmmusic-max · Pull Request #36453 · vllm-project/vllm · GitHub )
8-12. Add `is_device_capability_family(120)` checks to MoE backend selection in 5 files, so SM120 is recognized as NVFP4-capable and routes to the correct backend.
**Important note for reviewers**: These patches are necessary but not sufficient. They get the CUTLASS path to compile and attempt to run on SM120, but the underlying CUTLASS TMA WS grouped GEMM kernels still fail at runtime. The real fix needs to come from NVIDIA in the CUTLASS library itself ([issue #3096]( SM120 (Bug) (With FIx)(RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4 · Issue #3096 · NVIDIA/cutlass · GitHub )).
-–
## Inflated Claims
A community member (aabbccddwasd) has claimed 130-150 tok/s on 4x RTX PRO 6000 via custom SGLang/vLLM forks with MTP. I have reviewed both forks and found **zero kernel-level changes** – they use the same CUTLASS fallback with TMA WS failing. The forks contain only Python-level modifications (quantization config, attention registry, MTP state management).
The 130+ tok/s numbers likely include speculative token counting (counting proposed-then-rejected MTP tokens) or are burst measurements rather than sustained decode. My 50.5 tok/s is measured as actual output tokens delivered to the client over a 1000-token generation.
-–
## Upstream Issues
- [CUTLASS #3096]( SM120 (Bug) (With FIx)(RTX Blackwell) NVFP4 MoE: CUTLASS Grouped GEMM Produces Garbage Output; Fixed via FlashInfer SM120 Patches + compute_120f (CUDA 13.0) — 39 tok/s Native FP4 · Issue #3096 · NVIDIA/cutlass · GitHub ) – SM120 TMA WS grouped GEMM failure (no NVIDIA response)
- [CUTLASS #2800]( [BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121 · Issue #2800 · NVIDIA/cutlass · GitHub ) – BlockScaledMmaOp restricts FP4 to sm_100a
- [DeepGEMM #236]( Feature Request: Support sm_120 ( 5090 and blackwell 6000 pro ) · Issue #236 · deepseek-ai/DeepGEMM · GitHub ) – SM120 not supported
- [TensorRT-LLM #11368]( [Bug] FP4 CUTLASS GEMM fails on GB10 (SM121) — shared memory overflow from B200-sized tile configs · Issue #11368 · NVIDIA/TensorRT-LLM · GitHub ) – FP4 SMEM overflow on SM121
- [vLLM #35566]( CUDA illegal memory access in MoE layer with MiniMax-M2.5 NVFP4 on Blackwell (SM120) · Issue #35566 · vllm-project/vllm · GitHub ) – CUDA illegal memory access MoE SM120
-–
## Questions for the vLLM Community
1. Has anyone gotten native NVFP4 MoE (not Marlin fallback) working correctly on SM120? If so, what CUTLASS version, CUDA version, and FlashInfer build?
2. Is there a way to retrain or fine-tune MTP draft heads to match Marlin W4A16 activation distributions, so MTP can be useful on the Marlin path?
3. Are there alternative MoE kernel backends being developed (cuDNN grouped GEMM, Triton MoE with NVFP4, custom tile configs) that might bypass the CUTLASS issue?
4. For the vLLM maintainers reviewing [PR #36453]( fix: Add SM120 capability family check for FlashInfer NVFP4 MoE backends by brandonmmusic-max · Pull Request #36453 · vllm-project/vllm · GitHub ): the `is_device_capability_family(120)` checks are modeled on the existing `is_device_capability_family(100)` pattern. Should SM120 use a separate code path given the known CUTLASS issues, or should it share the SM100 path with a runtime warning?