When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high

Ysgg1 · November 5, 2025, 7:10am

Why does the GPU not overheat when running the benchmark with vllm 0.10.1 v0, but it does overheat when running the benchmark with v1, causing the request to get stuck afterwards? I ran it with 3 instances and tp2, and the graphics card was H20.

What are the steps to troubleshoot the high GPU occupancy rate?

Engine 000: Avg prompt throughput: 500.0 tokens/s, Avg generation throughput: 183.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 86.1%

RunLLM · November 5, 2025, 7:10am

High GPU KV cache usage (near 100%) in vLLM V1 can cause the engine to hang or requests to get stuck, especially under heavy load or with long sequences. This is a known issue and is discussed in several reports, where the server becomes unresponsive when the KV cache is full, while V0 may handle this more gracefully or with different preemption/swap strategies. The high GPU occupancy and overheating are likely due to V1’s more aggressive memory allocation and scheduling, which can lead to resource exhaustion and blocking behavior if not tuned properly. In contrast, V0 may preempt or swap requests differently, avoiding such stalls and excessive GPU load. Troubleshooting steps include monitoring KV cache usage, reducing max_model_len, max_num_seqs, or max_num_batched_tokens, and tuning gpu_memory_utilization to avoid hitting 100% cache usage. Also, ensure you are using the latest vLLM version, as some related bugs have been fixed in recent releases. For H20 GPUs and tensor parallelism, careful tuning is especially important due to memory constraints and parallel workload distribution (source1, source2, source3, source4).

To troubleshoot, first monitor the GPU KV cache usage and temperature during benchmarking. If you see usage approaching 100%, reduce the batch size (max_num_seqs), context length (max_model_len), or the number of batched tokens (max_num_batched_tokens). Lowering gpu_memory_utilization (default is 0.9) can also help prevent overallocation. If the problem persists, try increasing tensor parallelism or pipeline parallelism to distribute memory load, but be aware of possible synchronization overhead. For multi-instance setups, ensure each instance is not oversubscribing GPU memory. If overheating continues, check for driver or hardware issues, and consider limiting the number of concurrent requests. Always consult the vLLM logs for warnings about preemption or memory exhaustion (source5, source6).

Would you like a step-by-step guide for tuning these parameters or interpreting the logs?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	122	November 10, 2025
Vllm 0.10.1 v1 benchmark stuck General	1	140	November 5, 2025
Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+? General	1	40	January 31, 2026
Why vllm cannot fully use GPU in batch processing General	12	968	March 29, 2025
Vllm 0.10.1 v1 benchmark Only a part of the requests can be processed before it gets stuck General	1	97	November 4, 2025

When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high

Related topics