I’m using VLLM_CPU_NUM_OF_RESERVED_CPU to reserve CPU cores to understand the impact of number of threads with serving. However when I set OMP_NUM_THREADS to 2 or 72 on a GH200, I do not see a difference in online serving performance.
I have also set the VLLM_CPU_NUM_OF_RESERVED_CPU to be either 70 or 0 in those two cases. My configuration and results are below. What could be the reason for this?
The lack of performance difference when changing OMP_NUM_THREADS from 2 to 72 is likely because vLLM’s CPU backend is designed to bind one OpenMP thread per physical core, and the actual number of threads used for inference is controlled by VLLM_CPU_OMP_THREADS_BIND, not OMP_NUM_THREADS. When VLLM_CPU_OMP_THREADS_BIND is set to “auto”, vLLM automatically binds threads to available cores, and OMP_NUM_THREADS is typically ignored for inference workers—it’s mainly relevant for controlling thread usage in other libraries or in non-vLLM contexts. As a result, changing OMP_NUM_THREADS alone does not impact vLLM’s serving performance if VLLM_CPU_OMP_THREADS_BIND is set to “auto” and the thread binding is handled internally by vLLM. This is documented in the vLLM CPU installation and performance tuning guides, which recommend using VLLM_CPU_OMP_THREADS_BIND to control thread binding and not relying on OMP_NUM_THREADS for vLLM’s own threading behavior (docs, docs).
Additionally, VLLM_CPU_NUM_OF_RESERVED_CPU only reserves cores for the serving framework and does not affect the number of threads used for inference. The similar performance in both cases indicates that the number of inference threads remained unchanged, as vLLM’s auto thread binding logic was in effect. For explicit control, you should set VLLM_CPU_OMP_THREADS_BIND to a specific core list (e.g., “0-71”) rather than “auto”, and avoid relying on OMP_NUM_THREADS for vLLM’s internal parallelism (docs).
Would you like a step-by-step breakdown or example commands for explicit thread binding?
@RunLLM
I see no difference even when using VLLM_CPU_OMP_THREADS_BIND=0-1 or VLLM_CPU_OMP_THREADS_BIND=0-71 either. Why is this the case? The binding is either ineffective or this specific configuration does not depend on CPU. I don’t think it is the latter because of what I read on the release notes
CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced