Why latest vLLM which run in docker 24.04 uses constantly 100% of 2 full CPU cores while doing nothing? Just when the vllm server is started and the model is loaded and there is INFO: Application startup complete. The CPU cores are fully utilized. Its rocm6.4 and 2 7900 XTX
I start the docker with this:
And run the model with this:
“”
vllm serve /workspace/models/gemma-3n-E4B-it/
–tensor-parallel-size 2
–trust-remote-code
–kv-cache-dtype auto
–max-num-seqs 4
–tokenizer_mode auto
–port 8000
–host 0.0.0.0
This is a known issue: by default, vLLM can keep CPU worker threads busy in a polling loop even when idle, leading to 100% CPU usage per worker after startup and model load, even with no requests. This behavior is not specific to ROCm or your hardware, and has been reported by other users as well. The solution is to set the environment variable VLLM_SLEEP_WHEN_IDLE=1 when starting the container, which enables sleeping when idle and reduces CPU usage to near zero when the server is not handling requests. See this issue and solution for details.
Would you like more detail on how to set this variable or why this behavior occurs?
Yes, this is normal. vLLM uses CUDA graph (or ROCm graph) capture to optimize inference performance by tracing and recording execution graphs after model load, which is shown as “Capturing CUDA graphs: from 0 to 100%”. This process occurs on both NVIDIA and AMD (ROCm) GPUs and is expected during server startup before serving requests, as described in the vLLM documentation and confirmed in user logs for ROCm/AMD cards.
Would you like more detail on what CUDA/ROCm graph capture does or how it affects performance?
You can run gemma-3n-E4B-it because it is a much smaller model than google/gemma-3-27b-it or 12b, which require significantly more GPU memory. The 27B and 12B models exceed the combined VRAM of your 2x 7900 XTX cards, so vLLM cannot load them, resulting in out-of-memory (OOM) errors or failed initialization. This is a common limitation when the model size is larger than the available GPU memory, even with tensor parallelism enabled (see here, here).
Would you like more detail on memory requirements or possible workarounds?