Why does a larger max_num_batched_tokens lead to less available KV cache memory

A30 (24G VRAM) + vllm-0.18.0 + Qwen3.5-4B
`vllm serve /dev/shm/default_model --served-model-name default_model --enable-prompt-tokens-details --enable-mfu-metrics --max-num-batched-tokens=80K --max-num-seqs 128 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --no-enable-prefix-caching --port 8000 --host :: --language-model-only`

  • max-num-batched-tokens=80K ==> Available KV cache memory: 4.53 GiB, Maximum concurrency for 262,144 tokens per request: 0.56x (nvidia-smi shows 14628MiB / 24576MiB)

  • max-num-batched-tokens=64K ==> Available KV cache memory: 6.24 GiB, Maximum concurrency for 262,144 tokens per request: 0.77x (nvidia-smi shows 16450MiB / 24576MiB)

  • max-num-batched-tokens=32K ==> Available KV cache memory: 9.64 GiB, Maximum concurrency for 262,144 tokens per request: 1.20x (nvidia-smi shows 20190MiB / 24576MiB)

  • max-num-batched-tokens=16K ==> Available KV cache memory: 11.32 GiB, Maximum concurrency for 262,144 tokens per request: 1.40x (nvidia-smi shows 21232MiB / 24576MiB)

    Why does a larger max_num_batched_tokens lead to less available KV cache memory? And is there a way to calculate how much the memory decreases when max_num_batched_tokens changes?

A larger max_num_batched_tokens reduces available KV cache memory because vLLM reserves more GPU memory for activations during the prefill (batching) phase, leaving less for the KV cache. The tradeoff is: higher max_num_batched_tokens enables larger prefill batches (potentially better throughput for prompt-heavy workloads), but at the cost of smaller KV cache and thus lower maximum concurrency or context length per request. This is a direct result of vLLM’s static partitioning of GPU memory between activations and KV cache at engine startup, as described in this issue and the docs.

There is no simple formula in the documentation for exactly how much memory is reserved for activations as max_num_batched_tokens increases, since it depends on model architecture and implementation details. However, the relationship is that increasing max_num_batched_tokens increases activation memory linearly, reducing the memory left for KV cache. To find the optimal value, you can empirically test different settings and observe the reported “Available KV cache memory” at startup, as you have done, or use benchmarking tools as suggested in this discussion and the docs.

Would you like a step-by-step breakdown of the memory calculation or guidance on tuning for your workload?

Sources:

Hint: Mention @RunLLM in the post for followups.