Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+?

Hi everyone,

I’m running vLLM for large-scale inference on H100 GPUs, and I’m seeing lower-than-expected average GPU utilization.

vllm setup command
docker run -d \ --name vllm-dp8 \ --gpus all \ -p 8000:8000 \ --ipc=host \ -v /projects/data/downloads/nauman/lang_filter/nemov2:/workspace \ vllm/vllm-openai:latest \ --model EssentialAI/eai-distill-0.5b \ --dtype float16 \ --data-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --max-num-seqs 4096 \ --max-num-batched-tokens 131072 \ --enable-chunked-prefill \ --enable-prefix-caching \ --disable-log-requests \ --disable-log-stats

Setup

  • GPU: NVIDIA H100

  • Framework: vLLM (latest)

  • Serving via: OpenAI-compatible API

  • GPU Memory Utilization: ~90%

  • GPU Compute Utilization:

    • Peaks: ~70–90%

    • Average: ~40–70%

Repository (client + workload generator):
https://github.com/Noman654/Essential_ai_quality_classifier.git

Goal

I’m trying to achieve sustained ~90%+ GPU utilization for inference-heavy workloads.

Current Behavior

  • Memory is mostly full, so KV cache is not the limiting factor.

  • Utilization fluctuates heavily.

  • GPU often waits between batches.

  • Increasing traffic only improves utilization slightly.

What I’ve Tried

  • Increasing max_num_seqs

  • Increasing max_num_batched_tokens

  • Adjusting concurrency on client side

  • Running multiple clients

Still, average utilization stays below ~70%.

Sustained high GPU utilization in vLLM depends on both server and client-side batching, request concurrency, and the efficiency of request scheduling. Even with high memory usage, GPU compute utilization can be limited by factors such as CPU bottlenecks, suboptimal batching, or request arrival patterns. Increasing max_num_seqs and max_num_batched_tokens helps, but if the client does not send enough concurrent requests to saturate the batch, the GPU will still idle between batches. Also, vLLM’s async scheduling and chunked prefill can impact how well requests are packed into each batch, affecting utilization. For data parallel setups, ensure each replica is fully loaded with requests, and that the client concurrency matches or exceeds the aggregate batch size across all replicas. According to the vLLM optimization guide, tuning both server and client concurrency is critical, and monitoring preemption or queueing can help identify bottlenecks.

Additionally, CPU-side preprocessing (especially for multi-modal or large input requests) can become a bottleneck, as seen in this issue, and may limit how quickly requests are added to the batch. If you observe that the “adding requests” phase is slow, consider scaling out API servers (--api-server-count), optimizing client request patterns, or profiling CPU usage. For further improvement, review the parallelism and scaling documentation to ensure optimal use of tensor/pipeline/data parallelism and network configuration.

Would you like a step-by-step checklist to diagnose and address each potential bottleneck?

Sources:

Hint: Mention @RunLLM in the post for followups.