How to control number of CPUs used for GPU inference?

rajesh-s · September 27, 2025, 5:43am

Is there a way to control the number of CPU threads allocated for pre-processing tasks when using a GH200, where the inference itself runs on the GPU but the pre-processing steps outlined on the release notes on the CPU?

As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.

VLLM_CPU_OMP_THREADS_BIND seems targeted towards running inference on a CPU backend and not the pre-processing tasks.

RunLLM · September 27, 2025, 5:43am

No, there is currently no explicit vLLM parameter to control the number of CPU threads used for pre-processing tasks (such as tokenization, detokenization, API server, and scheduling) when running inference on GPU. The VLLM_CPU_OMP_THREADS_BIND environment variable is specifically for controlling OpenMP threads in the CPU backend, not for pre-processing tasks in GPU inference mode. This limitation is acknowledged in the vLLM community, and CPU bottlenecks for pre-processing are a known issue as GPU speeds increase, especially with small models on fast GPUs like H100s. For now, thread control for pre-processing is not directly configurable via vLLM settings or environment variables for GPU inference workloads. See discussion in vLLM GitHub Issue #14538 and vLLM documentation.

Would you like more detail on possible workarounds or architectural changes being discussed in the vLLM community?

Sources:

Hint: Mention @RunLLM in the post for followups.

rajesh-s · September 27, 2025, 5:44am

@RunLLM, how would I measure the CPU overheads? Is there a discussion on that?

RunLLM · September 27, 2025, 5:45am

To measure CPU overheads in vLLM, you can use Python’s built-in cProfile module, and vLLM provides helpers like vllm.utils.cprofile (decorator) and vllm.utils.cprofile_context (context manager) to profile specific code sections. These tools allow you to capture and analyze CPU time spent in pre-processing, scheduling, and other Python-side tasks. For analysis, tools like snakeviz can visualize the profiling results. This approach is recommended in the vLLM profiling documentation.

Would you like a code example or more detail on interpreting the results?

Sources:

Hint: Mention @RunLLM in the post for followups.

rajesh-s · September 27, 2025, 5:46am

@RunLLM, how can I measure how much time each of the above stages consumes in end to end inference?

RunLLM · September 27, 2025, 5:46am

You can measure the time spent in each stage (API server, scheduling, input preparation, detokenization, streaming) by instrumenting the vLLM code with Python profiling tools such as cProfile, using the provided vllm.utils.cprofile decorator or context manager. This will give you function-level CPU time breakdowns for the Python code, including API server handling, scheduling, and pre/post-processing steps. See vLLM profiling documentation and profiling code examples.

For more granular, stage-specific timing (e.g., queue, prefill, decode, detokenization), you may need to add custom timestamp logging at key points in the code, as vLLM’s built-in metrics focus on model-side intervals (queue, prefill, decode, inference) but do not natively break down all Python-side stages. See metrics design for details on what is tracked by default.

Would you like a code snippet or guidance on where to add custom timers?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM performance insensitive to CPU binding General	2	147	September 24, 2025
CPU utilization is extremely high during inference and becomes the primary performance bottleneck General	1	291	November 10, 2025
Why vLLM uses alot of CPU memory General	1	1314	April 21, 2025
VLLM supporting CPU and GPU General	1	278	September 4, 2025
Does vllm support inference or service startup of CPU small model? Hardware Support	3	176	May 30, 2025

How to control number of CPUs used for GPU inference?

Related topics