Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+?

Noman654 · January 31, 2026, 7:24pm

Hi everyone,

I’m running vLLM for large-scale inference on H100 GPUs, and I’m seeing lower-than-expected average GPU utilization.

vllm setup command
docker run -d \ --name vllm-dp8 \ --gpus all \ -p 8000:8000 \ --ipc=host \ -v /projects/data/downloads/nauman/lang_filter/nemov2:/workspace \ vllm/vllm-openai:latest \ --model EssentialAI/eai-distill-0.5b \ --dtype float16 \ --data-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --max-num-seqs 4096 \ --max-num-batched-tokens 131072 \ --enable-chunked-prefill \ --enable-prefix-caching \ --disable-log-requests \ --disable-log-stats

Setup

GPU: NVIDIA H100
Framework: vLLM (latest)
Serving via: OpenAI-compatible API
GPU Memory Utilization: ~90%
GPU Compute Utilization:
- Peaks: ~70–90%
- Average: ~40–70%

Repository (client + workload generator):
https://github.com/Noman654/Essential_ai_quality_classifier.git

Goal

I’m trying to achieve sustained ~90%+ GPU utilization for inference-heavy workloads.

Current Behavior

Memory is mostly full, so KV cache is not the limiting factor.
Utilization fluctuates heavily.
GPU often waits between batches.
Increasing traffic only improves utilization slightly.

What I’ve Tried

Increasing max_num_seqs
Increasing max_num_batched_tokens
Adjusting concurrency on client side
Running multiple clients

Still, average utilization stays below ~70%.

RunLLM · January 31, 2026, 7:25pm

Sustained high GPU utilization in vLLM depends on both server and client-side batching, request concurrency, and the efficiency of request scheduling. Even with high memory usage, GPU compute utilization can be limited by factors such as CPU bottlenecks, suboptimal batching, or request arrival patterns. Increasing max_num_seqs and max_num_batched_tokens helps, but if the client does not send enough concurrent requests to saturate the batch, the GPU will still idle between batches. Also, vLLM’s async scheduling and chunked prefill can impact how well requests are packed into each batch, affecting utilization. For data parallel setups, ensure each replica is fully loaded with requests, and that the client concurrency matches or exceeds the aggregate batch size across all replicas. According to the vLLM optimization guide, tuning both server and client concurrency is critical, and monitoring preemption or queueing can help identify bottlenecks.

Additionally, CPU-side preprocessing (especially for multi-modal or large input requests) can become a bottleneck, as seen in this issue, and may limit how quickly requests are added to the batch. If you observe that the “adding requests” phase is slow, consider scaling out API servers (--api-server-count), optimizing client request patterns, or profiling CPU usage. For further improvement, review the parallelism and scaling documentation to ensure optimal use of tensor/pipeline/data parallelism and network configuration.

Would you like a step-by-step checklist to diagnose and address each potential bottleneck?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
When running the benchmark with vllm 0.10.1 V1, the GPU usage was very high General	1	196	November 5, 2025
Why vllm cannot fully use GPU in batch processing General	12	1159	March 29, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	227	November 10, 2025
vLLM Hihg CPU usage when doin nothing General	5	875	July 24, 2025
Benchmarking vllm performance - Having trouble sending requests to vllm servers General	1	168	August 18, 2025

Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+?

Setup

Goal

Current Behavior

What I’ve Tried

Related topics