High CPU utilization during vLLM inference, especially with multi-modal models like InternVL-2.5-1B, is a known bottleneck—often due to CPU-bound preprocessing (e.g., image decoding, vision feature extraction) that limits GPU utilization and overall throughput. This is especially pronounced for video/image models, where the CPU must process visual data before the GPU can perform inference, as confirmed by benchmarking and user reports in the vLLM community. Limiting CPU cores further exacerbates this, leading to much slower inference times, as you observed. According to vLLM multi-modal performance benchmarks, the CPU can become saturated while the GPU remains underutilized, especially for multi-frame or image-heavy inputs.
To mitigate this, you can try: (1) increasing the number of available CPU cores for preprocessing, (2) using API server scale-out (multiple API server processes) to parallelize input processing, though this may not always help and can sometimes reduce throughput due to resource contention, and (3) profiling your workload to identify the exact bottleneck (e.g., image decoding, feature extraction) and optimize those steps if possible. There is currently no built-in vLLM feature to offload or accelerate these CPU-bound steps, and the issue is under active discussion in the community. For now, maximizing CPU resources and optimizing preprocessing are the main workarounds. See vLLM multi-modal performance benchmarks and Low GPU Utilization with Image Payloads for detailed discussion and user experiments.
Would you like a step-by-step breakdown of possible optimizations or profiling strategies?