Vllm-0.18.0 kv cache使用率从100%掉到0%

crystalww · April 30, 2026, 6:54am

vllm serve Qwen3.5-9B --served-model-name default_model --enable-prompt-tokens-details --enable-mfu-metrics --max-num-batched-tokens=64K --max-num-seqs 128 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --no-enable-prefix-caching --port 8000 --host :: --language-model-only

Available KV cache memory: 45.94 GiB
Maximum concurrency for 262,144 tokens per request: 5.70x

vllm bench serve   --seed 42 --disable-shuffle   --backend openai   --endpoint /v1/completions   --host $MY_POD_IPV6   --port 8000   --request-rate 18   --max-concurrency 18   --num-prompts 60   --input-len 80000   --output-len 800   --temperature 0.0   --top-p 1.0

按80K的输入来算，大概刚好能同时容纳18个请求使kv cache利用率最高。

压测时发现有段时间metrics没有吞吐（vllm log有吞吐）

这段时间 kv cache 使用率从100%掉到0%，running 队列请求全部变成waiting

但是TFLOPS和memory write还有

是发生了什么？

RunLLM · April 30, 2026, 6:55am

When vLLM’s KV cache reaches 100% utilization, new requests are moved from the running to the waiting queue, and prompt/generation throughput drops to zero until cache is freed. This is expected behavior: vLLM preempts or blocks new requests when there isn’t enough KV cache space, causing a temporary stall in throughput and a drop in cache usage as old requests complete and memory is released. During this period, TFLOPS and memory bandwidth may remain nonzero due to ongoing computation on remaining requests or cache cleanup, but no new prompts are processed until space is available again. This pattern matches your metrics and graphs, where cache utilization drops from 100% to 0% and all running requests become waiting, with throughput stalling temporarily before resuming as cache is freed (Optimization and Tuning, Metrics, related issue).

If this happens frequently, consider lowering max-num-seqs or max-num-batched-tokens, or increasing available GPU memory or parallelism to reduce preemption and improve throughput stability (Optimization and Tuning). Would you like a step-by-step explanation of how vLLM manages cache and request scheduling in this scenario?

Sources:

Hint: Mention @RunLLM in the post for followups.

crystalww · April 30, 2026, 6:56am

@RunLLM 但是没有看到发生抢占

crystalww · April 30, 2026, 7:05am

@RunLLM bench并发改成15也会掉到0%，此时kv cache 使用率最高82.1%，没有抢占记录

Topic		Replies	Views
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	180	November 10, 2025
On prefill-only, kv_cache_usage_perc reaches max 0.32? Benchmarking	1	55	February 3, 2026
Why does a larger max_num_batched_tokens lead to less available KV cache memory General	1	33	April 29, 2026
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	682	September 8, 2025
Gigantic tmp directory General	10	208	July 15, 2025

Vllm-0.18.0 kv cache使用率从100%掉到0%

Related topics