On prefill-only, kv_cache_usage_perc reaches max 0.32?

biranofer · February 3, 2026, 2:04pm

Would appreciate referring to the phenomena described in

github.com/llm-d-incubation/workload-variant-autoscaler

Better understanding of the kv_cache_usage_perc and num_requests_running metrics can be used to refine the saturation-based algorithm

opened 09:59AM - 28 Jan 26 UTC

needs-triage

The KVCache percentage metric presents at some scenarios unexpected behavior. E.…g., on prefill-only experiment: - GPU H100 - Input tokens per request: 6000 - Output tokens: 2 - Requests per second: 8 - Max-Tokens-in-Batch: 65000 <img width="1268" height="1023" alt="Image" src="https://github.com/user-attachments/assets/f31f2649-c9de-444d-a590-4616045d385e" /> We see in the above experiment graph that we are clearly in saturation state (num_requests_waiting is constantly growing), the num_requests_running stabilized on ~20, and the kv_cache_usage_perc stabilized on just ~0.32 (32%), so in this scenario the KVCache will not trigger a saturation indication. An interesting question is how the num_requests_running got to 20 - with 6000 tokens per request it sums up to 120000 - above the 65000 (Max-Tokens-in-Batch) - this is in vLLLM 14.0 (and different than was in e.g. vLLM 11.0). When running the same experiment, just with output tokens per request 200 (i.e. combining prefill and decode tasks), we get this result: <img width="1275" height="1046" alt="Image" src="https://github.com/user-attachments/assets/6f0e1443-9560-46b3-8835-a4364e993743" /> Here the num_requests_running got to ~60 - this is since decode requests are counted as just 1 token, so many more requests can be scheduled. This time the kv_cache_usage_perc gets to almost 1 (100%). Better understanding of these metrics can be used to refine the saturation-based algorithm, providing better prediction and proactive scaling operations.

RunLLM · February 3, 2026, 2:05pm

The phenomena described in Better understanding of the kv_cache_usage_perc and num_requests_running metrics can be used to refine the saturation-based algorithm · Issue #662 · llm-d-incubation/workload-variant-autoscaler · GitHub involve unexpected behavior of the kv_cache_usage_perc and num_requests_running metrics in vLLM under certain load conditions. Specifically, even when the system is saturated (num_requests_waiting grows and num_requests_running stabilizes), the kv_cache_usage_perc may remain low (e.g., ~32%), indicating that cache usage alone may not reliably signal server saturation. This suggests that autoscaling algorithms should consider both metrics together, not just kv_cache usage, to accurately detect and respond to saturation states.

Would you like more detail on how these metrics interact or how to use them for autoscaling?

Sources:

Better understanding of the kv_cache_usage_perc and num_requests_running metrics can be used to refine the saturation-based algorithm · Issue #662 · llm-d-incubation/workload-variant-autoscaler

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Vllm-0.18.0 kv cache使用率从100%掉到0% General	3	25	April 30, 2026
Why does a larger max_num_batched_tokens lead to less available KV cache memory General	1	33	April 29, 2026
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	180	November 10, 2025
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	25	293	December 17, 2025
Num request running stays on 1 Benchmarking	3	274	August 29, 2025

On prefill-only, kv_cache_usage_perc reaches max 0.32?

Related topics