vLLM Hihg CPU usage when doin nothing

youlearnit · July 24, 2025, 2:02pm

Hi,

Why latest vLLM which run in docker 24.04 uses constantly 100% of 2 full CPU cores while doing nothing? Just when the vllm server is started and the model is loaded and there is INFO: Application startup complete. The CPU cores are fully utilized. Its rocm6.4 and 2 7900 XTX
I start the docker with this:

docker run -it \
  --dns=8.8.8.8 \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --device /dev/kfd \
  --device /dev/dri \
  -e ROCM_VISIBLE_DEVICES=0,1,2,3 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3 \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e PYTORCH_TUNABLEOP_ENABLED=1 \
  -e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
  -e PYTORCH_ROCM_ARCH="gfx1100" \
  -e GPU_MAX_HW_QUEUES=1 \
  -v /home/ubuntu/vllm_models:/workspace/models \
  rocm/vllm:latest bash

And run the model with this:
“”
vllm serve /workspace/models/gemma-3n-E4B-it/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–max-num-seqs 4
–tokenizer_mode auto
–port 8000
–host 0.0.0.0

RunLLM · July 24, 2025, 2:02pm

This is a known issue: by default, vLLM can keep CPU worker threads busy in a polling loop even when idle, leading to 100% CPU usage per worker after startup and model load, even with no requests. This behavior is not specific to ROCm or your hardware, and has been reported by other users as well. The solution is to set the environment variable VLLM_SLEEP_WHEN_IDLE=1 when starting the container, which enables sleeping when idle and reduces CPU usage to near zero when the server is not handling requests. See this issue and solution for details.

Would you like more detail on how to set this variable or why this behavior occurs?

Sources:

vLLM Issue: CPU 100% usage when idle, solution with VLLM_SLEEP_WHEN_IDLE

youlearnit · July 24, 2025, 2:08pm

Why with VLLM and rocm and 7900 XTX, I get when starting to server a model:
Capturing CUDA graphs: from 0 to 100%.
Is that normal with AMD card?

RunLLM · July 24, 2025, 2:08pm

Yes, this is normal. vLLM uses CUDA graph (or ROCm graph) capture to optimize inference performance by tracing and recording execution graphs after model load, which is shown as “Capturing CUDA graphs: from 0 to 100%”. This process occurs on both NVIDIA and AMD (ROCm) GPUs and is expected during server startup before serving requests, as described in the vLLM documentation and confirmed in user logs for ROCm/AMD cards.

Would you like more detail on what CUDA/ROCm graph capture does or how it affects performance?

Sources:

vLLM documentation

youlearnit · July 24, 2025, 4:03pm

Why I can run gemma-3n-E4B-it with my setup, but for example google/gemma-3-27b-it or 12b version does not work. I have 2x 7900 XTX

RunLLM · July 24, 2025, 4:04pm

You can run gemma-3n-E4B-it because it is a much smaller model than google/gemma-3-27b-it or 12b, which require significantly more GPU memory. The 27B and 12B models exceed the combined VRAM of your 2x 7900 XTX cards, so vLLM cannot load them, resulting in out-of-memory (OOM) errors or failed initialization. This is a common limitation when the model size is larger than the available GPU memory, even with tensor parallelism enabled (see here, here).

Would you like more detail on memory requirements or possible workarounds?

Sources:

Topic		Replies	Views
2 cpu core 100% on idle General	1	201	September 22, 2025
Updated rocm/vllm latest and not anymore working General	1	218	November 4, 2025
Why vLLM uses alot of CPU memory General	1	1313	April 21, 2025
Low Average GPU Utilization (40–70%) on H100 with vLLM — How to Push Toward 90%+? General	1	40	January 31, 2026
Make install easier General	11	415	July 24, 2025

vLLM Hihg CPU usage when doin nothing

Related topics