SM120 (RTX PRO 4000): 6.5x throughput gain and v0.18.1 regression findings

Guiss-Guiss · April 3, 2026, 7:33pm

I spent a full day optimizing vLLM on 2x RTX PRO 4000 Blackwell (SM120, 24GB GDDR7 each) and want to share what I found because some of this was really painful to discover.

Setup: driver 580.126.20, CUDA 13.0, Ubuntu 24.04 kernel 6.17. Tested with kaitchup/Qwen3-4B-NVFP4 and Qwen/Qwen3-8B-AWQ across vLLM 0.15.1, 0.17.1, and 0.18.1.Result: went from 36 tok/s to 234 tok/s on the 4B model. Most of the gains came from things that should not have been problems in the first place.

## GPU clocks don’t boost on headless servers

This one cost me hours. After a reboot, both GPUs were sitting at 180 MHz GPU / 405 MHz memory. vLLM was running, accepting requests, and returning results - just incredibly slowly. No errors, no warnings, nothing in the logs.

Turns out the RTX PRO 4000 on a headless server does not auto-boost clocks when CUDA work is submitted. You have to explicitly lock them:

nvidia-smi -pm 1

nvidia-smi -i 0 --lock-gpu-clocks=3090,3090

nvidia-smi -i 1 --lock-gpu-clocks=3090,3090

That single change took throughput from 36 tok/s to 234 tok/s. 6.5x difference. And it doesn’t survive reboots - you need a systemd service to do this at boot.

I think vLLM should detect this and warn on startup. Something like “GPU 0 clock speed is 180 MHz, expected >1000 MHz for inference. Run nvidia-smi -pm 1 and lock your clocks.” Would save a lot of people a lot of time.

## v0.18.1 is 3x slower than v0.17.1 on SM120

I upgraded to 0.18.1 thinking newer is better. It was 3x slower on the same model and hardware. Had to revert to 0.17.1.

The V1 engine rewrite seems to add overhead that SM100 (datacenter Blackwell with TMEM) can absorb but SM120 can’t. SM120 falls back to Ampere-style execution paths where that extra dispatch cost is proportionally huge.

Related to #18153 which mentions SM120 GEMM support still needs work.

Is this regression tracked anywhere? I couldn’t find a specific issue for it.

## Features that hurt on SM120

These are all fine on datacenter GPUs but actively slow down SM120:

--kv-cache-dtype fp8_e5m2: SM120 auto-detects as FP8-capable, but the dequant overhead costs 10-15% with no benefit at low batch sizes. Set --kv-cache-dtype auto instead.

--enable-chunked-prefill: 10-20% slower for single-request workloads. The scheduling overhead isn’t worth it without concurrent batching.

--speculative-config ngram: 5-8% overhead on 4B models. The speculation/verification cost exceeds gains.

--enforce-eager: This one is obvious but worth mentioning - 50-70% throughput loss. CUDA graphs work fine on SM120 with driver 580+.

For SM120 I’d recommend: --attention-backend TRITON_ATTN --kv-cache-dtype auto and skip chunked prefill and speculative decoding for small models.

Should --enable-chunked-prefill maybe be off by default when --max-num-seqs is 1?

## NVFP4 beats AWQ on SM120

Even though SM120 can’t do native W4A4, the compressed-tensors Marlin kernel for NVFP4 is about 17% faster than AWQ Marlin. Smaller weights = less bandwidth pressure, and bandwidth is the bottleneck on SM120.

Tested kaitchup/Qwen3-4B-NVFP4 vs Qwen/Qwen3-4B-AWQ. If you’re on SM120 and using AWQ, try switching.

## Power throttling

The RTX PRO 4000 is a 140W card. Under sustained inference it pegs at 145W and clocks drop from ~2300 to ~1935 MHz. Purely power, not thermal (63C). nvidia-smi -pl 160 is rejected. This is just how the card is. Worth knowing if you’re planning sustained workloads.

## What I’m running now

vllm serve kaitchup/Qwen3-4B-NVFP4 \\

  --quantization compressed-tensors \\

  --dtype float16 \\

  --gpu-memory-utilization 0.85 \\

  --max-model-len 8192 \\

  --max-num-seqs 8 \\

  --enable-prefix-caching \\

  --attention-backend TRITON_ATTN \\

  --kv-cache-dtype auto \\

  --reasoning-parser qwen3

234 tok/s generation, 368 tok/s prompt on Qwen3-4B-NVFP4. Not bad for a 140W workstation card, but still well below what the bandwidth should theoretically allow.

Waiting on SM120 GEMM kernels (#18153), FlashInfer SM120 attention (Dao-AILab/flash-attention#2307), native NVFP4 support (#31085, #33416), and PyTorch cu132 wheels. Happy to test patches if anyone’s working on SM120 support.

RunLLM · April 3, 2026, 7:33pm

Thank you for sharing your detailed findings. Your observations about GPU clock management, version regressions, and feature impacts on SM120 are well-supported by community reports and known issues. Specifically, the need to manually lock GPU clocks for full performance on headless RTX PRO 4000 Blackwell is echoed by others, and there is currently no built-in vLLM warning for low clock speeds—your suggestion is valid and not yet implemented in vLLM (source, source).

Regarding vLLM version regressions, users have reported that v0.18.1 is significantly slower than v0.17.1 on SM120, likely due to the V1 engine rewrite and incomplete SM120 GEMM kernel support (source, source). Your recommendations to avoid certain features (e.g., FP8 KV cache, chunked prefill, speculative decoding) for SM120 are consistent with performance benchmarks and technical discussions. NVFP4 (compressed-tensors Marlin) is indeed faster than AWQ on SM120 due to reduced bandwidth pressure (source, source). Power throttling on the RTX PRO 4000 is a hardware limitation, as you noted.

Would you like a step-by-step breakdown of how to automate GPU clock locking at boot, or more detail on tracking SM120 kernel development?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B NVIDIA GPU Support	1	352	April 11, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	5765	January 5, 2026
RTX PRO 6000 users seek help, LLAMA 4 NVFP4 NVIDIA GPU Support	1	270	November 25, 2025
Do RTX 5090 and RTX PRO 5000 have some differencies which should be taken into account General	7	76	March 19, 2026
RTX 5090 + GLM incompatible issues - Please update General	2	491	January 4, 2026

SM120 (RTX PRO 4000): 6.5x throughput gain and v0.18.1 regression findings

Related topics