CPU utilization is extremely high during inference and becomes the primary performance bottleneck

Environment:

  • vLLM: 0.5.1

  • PyTorch: 2.3.0+cu121

  • Transformers: 4.41.2

  • GPU: NVIDIA L4

  • CPU: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Max clock: 3400 MHz)

Deployment Command:

python -m vllm.entrypoints.openai.api_server \
    --served-model-name internvl2_5 \
    --model internvl2_5_1B \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9 \
    --port 9084 \
    --trust-remote-code \
    --max-model-len 2432 \
    --max-num-seqs 2 \
    --max-num-batched-tokens 4864

Problem Description:

I am experiencing a severe CPU bottleneck when deploying the InternVL-2.5-1B model using vLLM on an NVIDIA L4 GPU.

The CPU utilization is extremely high during inference. To diagnose this, I conducted an experiment:

  1. Unrestricted CPU: The inference time is approximately 400ms.

  2. CPU Limited to 4 Cores: The inference time increases dramatically to 1200ms, which is three times slower.

This clearly indicates that the CPU has become the primary performance bottleneck, severely limiting the overall throughput of the system.

How can this critical CPU contention issue be resolved?

High CPU utilization during vLLM inference, especially with multi-modal models like InternVL-2.5-1B, is a known bottleneck—often due to CPU-bound preprocessing (e.g., image decoding, vision feature extraction) that limits GPU utilization and overall throughput. This is especially pronounced for video/image models, where the CPU must process visual data before the GPU can perform inference, as confirmed by benchmarking and user reports in the vLLM community. Limiting CPU cores further exacerbates this, leading to much slower inference times, as you observed. According to vLLM multi-modal performance benchmarks, the CPU can become saturated while the GPU remains underutilized, especially for multi-frame or image-heavy inputs.

To mitigate this, you can try: (1) increasing the number of available CPU cores for preprocessing, (2) using API server scale-out (multiple API server processes) to parallelize input processing, though this may not always help and can sometimes reduce throughput due to resource contention, and (3) profiling your workload to identify the exact bottleneck (e.g., image decoding, feature extraction) and optimize those steps if possible. There is currently no built-in vLLM feature to offload or accelerate these CPU-bound steps, and the issue is under active discussion in the community. For now, maximizing CPU resources and optimizing preprocessing are the main workarounds. See vLLM multi-modal performance benchmarks and Low GPU Utilization with Image Payloads for detailed discussion and user experiments.

Would you like a step-by-step breakdown of possible optimizations or profiling strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.