When vLLM’s KV cache reaches 100% utilization, new requests are moved from the running to the waiting queue, and prompt/generation throughput drops to zero until cache is freed. This is expected behavior: vLLM preempts or blocks new requests when there isn’t enough KV cache space, causing a temporary stall in throughput and a drop in cache usage as old requests complete and memory is released. During this period, TFLOPS and memory bandwidth may remain nonzero due to ongoing computation on remaining requests or cache cleanup, but no new prompts are processed until space is available again. This pattern matches your metrics and graphs, where cache utilization drops from 100% to 0% and all running requests become waiting, with throughput stalling temporarily before resuming as cache is freed (Optimization and Tuning, Metrics, related issue).
If this happens frequently, consider lowering max-num-seqs or max-num-batched-tokens, or increasing available GPU memory or parallelism to reduce preemption and improve throughput stability (Optimization and Tuning). Would you like a step-by-step explanation of how vLLM manages cache and request scheduling in this scenario?