vLLM Hihg CPU usage when doin nothing

Hi,

Why latest vLLM which run in docker 24.04 uses constantly 100% of 2 full CPU cores while doing nothing? Just when the vllm server is started and the model is loaded and there is INFO: Application startup complete. The CPU cores are fully utilized. Its rocm6.4 and 2 7900 XTX
I start the docker with this:

docker run -it \
  --dns=8.8.8.8 \
  --network=host \
  --group-add=video \
  --ipc=host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --privileged \
  --device /dev/kfd \
  --device /dev/dri \
  -e ROCM_VISIBLE_DEVICES=0,1,2,3 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3 \
  -e VLLM_USE_TRITON_FLASH_ATTN=0 \
  -e PYTORCH_TUNABLEOP_ENABLED=1 \
  -e HSA_OVERRIDE_GFX_VERSION=11.0.0 \
  -e PYTORCH_ROCM_ARCH="gfx1100" \
  -e GPU_MAX_HW_QUEUES=1 \
  -v /home/ubuntu/vllm_models:/workspace/models \
  rocm/vllm:latest bash

And run the model with this:
“”
vllm serve /workspace/models/gemma-3n-E4B-it/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–max-num-seqs 4
–tokenizer_mode auto
–port 8000
–host 0.0.0.0

This is a known issue: by default, vLLM can keep CPU worker threads busy in a polling loop even when idle, leading to 100% CPU usage per worker after startup and model load, even with no requests. This behavior is not specific to ROCm or your hardware, and has been reported by other users as well. The solution is to set the environment variable VLLM_SLEEP_WHEN_IDLE=1 when starting the container, which enables sleeping when idle and reduces CPU usage to near zero when the server is not handling requests. See this issue and solution for details.

Would you like more detail on how to set this variable or why this behavior occurs?

Sources:

Why with VLLM and rocm and 7900 XTX, I get when starting to server a model:
Capturing CUDA graphs: from 0 to 100%.
Is that normal with AMD card?

Yes, this is normal. vLLM uses CUDA graph (or ROCm graph) capture to optimize inference performance by tracing and recording execution graphs after model load, which is shown as “Capturing CUDA graphs: from 0 to 100%”. This process occurs on both NVIDIA and AMD (ROCm) GPUs and is expected during server startup before serving requests, as described in the vLLM documentation and confirmed in user logs for ROCm/AMD cards.

Would you like more detail on what CUDA/ROCm graph capture does or how it affects performance?

Sources:

Why I can run gemma-3n-E4B-it with my setup, but for example google/gemma-3-27b-it or 12b version does not work. I have 2x 7900 XTX

You can run gemma-3n-E4B-it because it is a much smaller model than google/gemma-3-27b-it or 12b, which require significantly more GPU memory. The 27B and 12B models exceed the combined VRAM of your 2x 7900 XTX cards, so vLLM cannot load them, resulting in out-of-memory (OOM) errors or failed initialization. This is a common limitation when the model size is larger than the available GPU memory, even with tensor parallelism enabled (see here, here).

Would you like more detail on memory requirements or possible workarounds?

Sources: