I spent a full day optimizing vLLM on 2x RTX PRO 4000 Blackwell (SM120, 24GB GDDR7 each) and want to share what I found because some of this was really painful to discover.
Setup: driver 580.126.20, CUDA 13.0, Ubuntu 24.04 kernel 6.17. Tested with kaitchup/Qwen3-4B-NVFP4 and Qwen/Qwen3-8B-AWQ across vLLM 0.15.1, 0.17.1, and 0.18.1.Result: went from 36 tok/s to 234 tok/s on the 4B model. Most of the gains came from things that should not have been problems in the first place.
## GPU clocks don’t boost on headless servers
This one cost me hours. After a reboot, both GPUs were sitting at 180 MHz GPU / 405 MHz memory. vLLM was running, accepting requests, and returning results - just incredibly slowly. No errors, no warnings, nothing in the logs.
Turns out the RTX PRO 4000 on a headless server does not auto-boost clocks when CUDA work is submitted. You have to explicitly lock them:
nvidia-smi -pm 1
nvidia-smi -i 0 --lock-gpu-clocks=3090,3090
nvidia-smi -i 1 --lock-gpu-clocks=3090,3090
That single change took throughput from 36 tok/s to 234 tok/s. 6.5x difference. And it doesn’t survive reboots - you need a systemd service to do this at boot.
I think vLLM should detect this and warn on startup. Something like “GPU 0 clock speed is 180 MHz, expected >1000 MHz for inference. Run nvidia-smi -pm 1 and lock your clocks.” Would save a lot of people a lot of time.
## v0.18.1 is 3x slower than v0.17.1 on SM120
I upgraded to 0.18.1 thinking newer is better. It was 3x slower on the same model and hardware. Had to revert to 0.17.1.
The V1 engine rewrite seems to add overhead that SM100 (datacenter Blackwell with TMEM) can absorb but SM120 can’t. SM120 falls back to Ampere-style execution paths where that extra dispatch cost is proportionally huge.
Related to #18153 which mentions SM120 GEMM support still needs work.
Is this regression tracked anywhere? I couldn’t find a specific issue for it.
## Features that hurt on SM120
These are all fine on datacenter GPUs but actively slow down SM120:
--kv-cache-dtype fp8_e5m2: SM120 auto-detects as FP8-capable, but the dequant overhead costs 10-15% with no benefit at low batch sizes. Set --kv-cache-dtype auto instead.
--enable-chunked-prefill: 10-20% slower for single-request workloads. The scheduling overhead isn’t worth it without concurrent batching.
--speculative-config ngram: 5-8% overhead on 4B models. The speculation/verification cost exceeds gains.
--enforce-eager: This one is obvious but worth mentioning - 50-70% throughput loss. CUDA graphs work fine on SM120 with driver 580+.
For SM120 I’d recommend: --attention-backend TRITON_ATTN --kv-cache-dtype auto and skip chunked prefill and speculative decoding for small models.
Should --enable-chunked-prefill maybe be off by default when --max-num-seqs is 1?
## NVFP4 beats AWQ on SM120
Even though SM120 can’t do native W4A4, the compressed-tensors Marlin kernel for NVFP4 is about 17% faster than AWQ Marlin. Smaller weights = less bandwidth pressure, and bandwidth is the bottleneck on SM120.
Tested kaitchup/Qwen3-4B-NVFP4 vs Qwen/Qwen3-4B-AWQ. If you’re on SM120 and using AWQ, try switching.
## Power throttling
The RTX PRO 4000 is a 140W card. Under sustained inference it pegs at 145W and clocks drop from ~2300 to ~1935 MHz. Purely power, not thermal (63C). nvidia-smi -pl 160 is rejected. This is just how the card is. Worth knowing if you’re planning sustained workloads.
## What I’m running now
vllm serve kaitchup/Qwen3-4B-NVFP4 \\
--quantization compressed-tensors \\
--dtype float16 \\
--gpu-memory-utilization 0.85 \\
--max-model-len 8192 \\
--max-num-seqs 8 \\
--enable-prefix-caching \\
--attention-backend TRITON_ATTN \\
--kv-cache-dtype auto \\
--reasoning-parser qwen3
234 tok/s generation, 368 tok/s prompt on Qwen3-4B-NVFP4. Not bad for a 140W workstation card, but still well below what the bandwidth should theoretically allow.
Waiting on SM120 GEMM kernels (#18153), FlashInfer SM120 attention (Dao-AILab/flash-attention#2307), native NVFP4 support (#31085, #33416), and PyTorch cu132 wheels. Happy to test patches if anyone’s working on SM120 support.