Why does a larger max_num_batched_tokens lead to less available KV cache memory

crystalww · April 29, 2026, 10:52am

A30 (24G VRAM) + vllm-0.18.0 + Qwen3.5-4B
`vllm serve /dev/shm/default_model --served-model-name default_model --enable-prompt-tokens-details --enable-mfu-metrics --max-num-batched-tokens=80K --max-num-seqs 128 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --no-enable-prefix-caching --port 8000 --host :: --language-model-only`

max-num-batched-tokens=80K ==> Available KV cache memory: 4.53 GiB, Maximum concurrency for 262,144 tokens per request: 0.56x (nvidia-smi shows 14628MiB / 24576MiB)
max-num-batched-tokens=64K ==> Available KV cache memory: 6.24 GiB, Maximum concurrency for 262,144 tokens per request: 0.77x (nvidia-smi shows 16450MiB / 24576MiB)
max-num-batched-tokens=32K ==> Available KV cache memory: 9.64 GiB, Maximum concurrency for 262,144 tokens per request: 1.20x (nvidia-smi shows 20190MiB / 24576MiB)
max-num-batched-tokens=16K ==> Available KV cache memory: 11.32 GiB, Maximum concurrency for 262,144 tokens per request: 1.40x (nvidia-smi shows 21232MiB / 24576MiB)

Why does a larger max_num_batched_tokens lead to less available KV cache memory? And is there a way to calculate how much the memory decreases when max_num_batched_tokens changes?

RunLLM · April 29, 2026, 10:52am

A larger max_num_batched_tokens reduces available KV cache memory because vLLM reserves more GPU memory for activations during the prefill (batching) phase, leaving less for the KV cache. The tradeoff is: higher max_num_batched_tokens enables larger prefill batches (potentially better throughput for prompt-heavy workloads), but at the cost of smaller KV cache and thus lower maximum concurrency or context length per request. This is a direct result of vLLM’s static partitioning of GPU memory between activations and KV cache at engine startup, as described in this issue and the docs.

There is no simple formula in the documentation for exactly how much memory is reserved for activations as max_num_batched_tokens increases, since it depends on model architecture and implementation details. However, the relationship is that increasing max_num_batched_tokens increases activation memory linearly, reducing the memory left for KV cache. To find the optimal value, you can empirically test different settings and observe the reported “Available KV cache memory” at startup, as you have done, or use benchmarking tools as suggested in this discussion and the docs.

Would you like a step-by-step breakdown of the memory calculation or guidance on tuning for your workload?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
The vLLM server fails on large max-num-batched-tokens General	1	505	September 13, 2025
Question about parameter: max-num-batched-tokens General	3	1534	November 25, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	682	September 8, 2025
Vllm-0.18.0 kv cache使用率从100%掉到0% General	3	25	April 30, 2026
Batched processing over openai api General	3	83	November 30, 2025

Why does a larger max_num_batched_tokens lead to less available KV cache memory

Related topics