Hi everyone,
I’m running vLLM for large-scale inference on H100 GPUs, and I’m seeing lower-than-expected average GPU utilization.
vllm setup command
docker run -d \ --name vllm-dp8 \ --gpus all \ -p 8000:8000 \ --ipc=host \ -v /projects/data/downloads/nauman/lang_filter/nemov2:/workspace \ vllm/vllm-openai:latest \ --model EssentialAI/eai-distill-0.5b \ --dtype float16 \ --data-parallel-size 8 \ --gpu-memory-utilization 0.95 \ --max-num-seqs 4096 \ --max-num-batched-tokens 131072 \ --enable-chunked-prefill \ --enable-prefix-caching \ --disable-log-requests \ --disable-log-stats
Setup
-
GPU: NVIDIA H100
-
Framework: vLLM (latest)
-
Serving via: OpenAI-compatible API
-
GPU Memory Utilization: ~90%
-
GPU Compute Utilization:
-
Peaks: ~70–90%
-
Average: ~40–70%
-
Repository (client + workload generator):
https://github.com/Noman654/Essential_ai_quality_classifier.git
Goal
I’m trying to achieve sustained ~90%+ GPU utilization for inference-heavy workloads.
Current Behavior
-
Memory is mostly full, so KV cache is not the limiting factor.
-
Utilization fluctuates heavily.
-
GPU often waits between batches.
-
Increasing traffic only improves utilization slightly.
What I’ve Tried
-
Increasing max_num_seqs
-
Increasing max_num_batched_tokens
-
Adjusting concurrency on client side
-
Running multiple clients
Still, average utilization stays below ~70%.