Vllm-0.18.0 kv cache使用率从100%掉到0%

vllm serve Qwen3.5-9B --served-model-name default_model --enable-prompt-tokens-details --enable-mfu-metrics --max-num-batched-tokens=64K --max-num-seqs 128 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml --no-enable-prefix-caching --port 8000 --host :: --language-model-only

Available KV cache memory: 45.94 GiB
Maximum concurrency for 262,144 tokens per request: 5.70x

vllm bench serve   --seed 42 --disable-shuffle   --backend openai   --endpoint /v1/completions   --host $MY_POD_IPV6   --port 8000   --request-rate 18   --max-concurrency 18   --num-prompts 60   --input-len 80000   --output-len 800   --temperature 0.0   --top-p 1.0

按80K的输入来算,大概刚好能同时容纳18个请求使kv cache利用率最高。

压测时发现有段时间metrics没有吞吐(vllm log有吞吐)


这段时间 kv cache 使用率从100%掉到0%,running 队列请求全部变成waiting

但是TFLOPS和memory write还有

是发生了什么?

When vLLM’s KV cache reaches 100% utilization, new requests are moved from the running to the waiting queue, and prompt/generation throughput drops to zero until cache is freed. This is expected behavior: vLLM preempts or blocks new requests when there isn’t enough KV cache space, causing a temporary stall in throughput and a drop in cache usage as old requests complete and memory is released. During this period, TFLOPS and memory bandwidth may remain nonzero due to ongoing computation on remaining requests or cache cleanup, but no new prompts are processed until space is available again. This pattern matches your metrics and graphs, where cache utilization drops from 100% to 0% and all running requests become waiting, with throughput stalling temporarily before resuming as cache is freed (Optimization and Tuning, Metrics, related issue).

If this happens frequently, consider lowering max-num-seqs or max-num-batched-tokens, or increasing available GPU memory or parallelism to reduce preemption and improve throughput stability (Optimization and Tuning). Would you like a step-by-step explanation of how vLLM manages cache and request scheduling in this scenario?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM 但是没有看到发生抢占

@RunLLM bench并发改成15也会掉到0%,此时kv cache 使用率最高82.1%,没有抢占记录