vLLM performance insensitive to CPU binding

rajesh-s · September 24, 2025, 6:25pm

Hi,

I’m using VLLM_CPU_NUM_OF_RESERVED_CPU to reserve CPU cores to understand the impact of number of threads with serving. However when I set OMP_NUM_THREADS to 2 or 72 on a GH200, I do not see a difference in online serving performance.

I have also set the VLLM_CPU_NUM_OF_RESERVED_CPU to be either 70 or 0 in those two cases. My configuration and results are below. What could be the reason for this?

MODEL="meta-llama/Meta-Llama-3.1-8B-Instruct"
DATASET_NAME="sharegpt"
DATASET_PATH="ShareGPT_V3_unfiltered_cleaned_split.json"
NUM_PROMPTS=4096
HOST="localhost"
PORT=8000
BASE_URL="http://${HOST}:${PORT}"
REQUEST_RATE=55

# CPU binding config 
OMP_NUM_THREADS=2
RESERVED_CORES=$((72 - OMP_NUM_THREADS))

echo "Starting vLLM server..." >&2
VLLM_CPU_OMP_THREADS_BIND="auto" \
VLLM_CPU_NUM_OF_RESERVED_CPU="$RESERVED_CORES" \
OMP_NUM_THREADS="$OMP_NUM_THREADS" \
vllm serve "$MODEL" \
  --host "$HOST" \
  --port "$PORT" \
  --disable-log-requests &

2 thread results:

72 thread results:

RunLLM · September 24, 2025, 6:26pm

The lack of performance difference when changing OMP_NUM_THREADS from 2 to 72 is likely because vLLM’s CPU backend is designed to bind one OpenMP thread per physical core, and the actual number of threads used for inference is controlled by VLLM_CPU_OMP_THREADS_BIND, not OMP_NUM_THREADS. When VLLM_CPU_OMP_THREADS_BIND is set to “auto”, vLLM automatically binds threads to available cores, and OMP_NUM_THREADS is typically ignored for inference workers—it’s mainly relevant for controlling thread usage in other libraries or in non-vLLM contexts. As a result, changing OMP_NUM_THREADS alone does not impact vLLM’s serving performance if VLLM_CPU_OMP_THREADS_BIND is set to “auto” and the thread binding is handled internally by vLLM. This is documented in the vLLM CPU installation and performance tuning guides, which recommend using VLLM_CPU_OMP_THREADS_BIND to control thread binding and not relying on OMP_NUM_THREADS for vLLM’s own threading behavior (docs, docs).

Additionally, VLLM_CPU_NUM_OF_RESERVED_CPU only reserves cores for the serving framework and does not affect the number of threads used for inference. The similar performance in both cases indicates that the number of inference threads remained unchanged, as vLLM’s auto thread binding logic was in effect. For explicit control, you should set VLLM_CPU_OMP_THREADS_BIND to a specific core list (e.g., “0-71”) rather than “auto”, and avoid relying on OMP_NUM_THREADS for vLLM’s internal parallelism (docs).

Would you like a step-by-step breakdown or example commands for explicit thread binding?

Sources:

docs

Hint: Mention @RunLLM in the post for followups.

rajesh-s · September 24, 2025, 6:34pm

@RunLLM
I see no difference even when using VLLM_CPU_OMP_THREADS_BIND=0-1 or VLLM_CPU_OMP_THREADS_BIND=0-71 either. Why is this the case? The binding is either ineffective or this specific configuration does not depend on CPU. I don’t think it is the latter because of what I read on the release notes

CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced

VLLM_CPU_OMP_THREADS_BIND=0-1

Topic		Replies	Views
How to control number of CPUs used for GPU inference? General	5	234	September 27, 2025
Questions about performance tradeoffs between process-based and thread-based orchestration for multiple independent vllm serve instances General	1	15	March 10, 2026
vLLM Hihg CPU usage when doin nothing General	5	713	July 24, 2025
Does vLLM support starting multiple workers when launching a Uvicorn server? General	3	377	September 1, 2025
"vllm serve" to use all GPU's of a single-node multi-use General	5	772	August 1, 2025

vLLM performance insensitive to CPU binding

Related topics