When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high

Why does the GPU not overheat when running the benchmark with vllm 0.10.1 v0, but it does overheat when running the benchmark with v1, causing the request to get stuck afterwards? I ran it with 3 instances and tp2, and the graphics card was H20.

What are the steps to troubleshoot the high GPU occupancy rate?

Engine 000: Avg prompt throughput: 500.0 tokens/s, Avg generation throughput: 183.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 86.1%

High GPU KV cache usage (near 100%) in vLLM V1 can cause the engine to hang or requests to get stuck, especially under heavy load or with long sequences. This is a known issue and is discussed in several reports, where the server becomes unresponsive when the KV cache is full, while V0 may handle this more gracefully or with different preemption/swap strategies. The high GPU occupancy and overheating are likely due to V1’s more aggressive memory allocation and scheduling, which can lead to resource exhaustion and blocking behavior if not tuned properly. In contrast, V0 may preempt or swap requests differently, avoiding such stalls and excessive GPU load. Troubleshooting steps include monitoring KV cache usage, reducing max_model_len, max_num_seqs, or max_num_batched_tokens, and tuning gpu_memory_utilization to avoid hitting 100% cache usage. Also, ensure you are using the latest vLLM version, as some related bugs have been fixed in recent releases. For H20 GPUs and tensor parallelism, careful tuning is especially important due to memory constraints and parallel workload distribution (source1, source2, source3, source4).

To troubleshoot, first monitor the GPU KV cache usage and temperature during benchmarking. If you see usage approaching 100%, reduce the batch size (max_num_seqs), context length (max_model_len), or the number of batched tokens (max_num_batched_tokens). Lowering gpu_memory_utilization (default is 0.9) can also help prevent overallocation. If the problem persists, try increasing tensor parallelism or pipeline parallelism to distribute memory load, but be aware of possible synchronization overhead. For multi-instance setups, ensure each instance is not oversubscribing GPU memory. If overheating continues, check for driver or hardware issues, and consider limiting the number of concurrent requests. Always consult the vLLM logs for warnings about preemption or memory exhaustion (source5, source6).

Would you like a step-by-step guide for tuning these parameters or interpreting the logs?

Sources:

Hint: Mention @RunLLM in the post for followups.