CPU utilization is extremely high during inference and becomes the primary performance bottleneck

Lcx2000 · November 10, 2025, 12:05pm

Environment:

vLLM: 0.5.1
PyTorch: 2.3.0+cu121
Transformers: 4.41.2
GPU: NVIDIA L4
CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Max clock: 3400 MHz)

Deployment Command:

python -m vllm.entrypoints.openai.api_server \
    --served-model-name internvl2_5 \
    --model internvl2_5_1B \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --port 9084 \
    --trust-remote-code \
    --max-model-len 2432 \
    --max-num-seqs 2 \
    --max-num-batched-tokens 4864

Problem Description:

I am experiencing a severe CPU bottleneck when deploying the InternVL-2.5-1B model using vLLM on an NVIDIA L4 GPU.

The CPU utilization is extremely high during inference. To diagnose this, I conducted an experiment:

Unrestricted CPU: The inference time is approximately 400ms.
CPU Limited to 4 Cores: The inference time increases dramatically to 1200ms, which is three times slower.

This clearly indicates that the CPU has become the primary performance bottleneck, severely limiting the overall throughput of the system.

How can this critical CPU contention issue be resolved?

RunLLM · November 10, 2025, 12:06pm

High CPU utilization during vLLM inference, especially with multi-modal models like InternVL-2.5-1B, is a known bottleneck—often due to CPU-bound preprocessing (e.g., image decoding, vision feature extraction) that limits GPU utilization and overall throughput. This is especially pronounced for video/image models, where the CPU must process visual data before the GPU can perform inference, as confirmed by benchmarking and user reports in the vLLM community. Limiting CPU cores further exacerbates this, leading to much slower inference times, as you observed. According to vLLM multi-modal performance benchmarks, the CPU can become saturated while the GPU remains underutilized, especially for multi-frame or image-heavy inputs.

To mitigate this, you can try: (1) increasing the number of available CPU cores for preprocessing, (2) using API server scale-out (multiple API server processes) to parallelize input processing, though this may not always help and can sometimes reduce throughput due to resource contention, and (3) profiling your workload to identify the exact bottleneck (e.g., image decoding, feature extraction) and optimize those steps if possible. There is currently no built-in vLLM feature to offload or accelerate these CPU-bound steps, and the issue is under active discussion in the community. For now, maximizing CPU resources and optimizing preprocessing are the main workarounds. See vLLM multi-modal performance benchmarks and Low GPU Utilization with Image Payloads for detailed discussion and user experiments.

Would you like a step-by-step breakdown of possible optimizations or profiling strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to control number of CPUs used for GPU inference? General	5	179	September 27, 2025
Why vLLM uses alot of CPU memory General	1	1314	April 21, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	86	September 14, 2025
vLLM Hihg CPU usage when doin nothing General	5	517	July 24, 2025
Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+? General	1	40	January 31, 2026

CPU utilization is extremely high during inference and becomes the primary performance bottleneck

Environment:

Deployment Command:

Problem Description:

Related topics