Is there a way to control the number of CPU threads allocated for pre-processing tasks when using a GH200, where the inference itself runs on the GPU but the pre-processing steps outlined on the release notes on the CPU?
As GPUs are getting faster and significantly reducing model execution times, the CPU overhead for tasks like running the API server, scheduling work, preparing inputs, de-tokenizing outputs, and streaming responses to users becomes increasingly pronounced. This issue is particularly noticeable with smaller models like Llama-8B running on NVIDIA H100 GPUs, where execution time on the GPU is as low as ~5ms.
VLLM_CPU_OMP_THREADS_BIND seems targeted towards running inference on a CPU backend and not the pre-processing tasks.
No, there is currently no explicit vLLM parameter to control the number of CPU threads used for pre-processing tasks (such as tokenization, detokenization, API server, and scheduling) when running inference on GPU. The VLLM_CPU_OMP_THREADS_BIND environment variable is specifically for controlling OpenMP threads in the CPU backend, not for pre-processing tasks in GPU inference mode. This limitation is acknowledged in the vLLM community, and CPU bottlenecks for pre-processing are a known issue as GPU speeds increase, especially with small models on fast GPUs like H100s. For now, thread control for pre-processing is not directly configurable via vLLM settings or environment variables for GPU inference workloads. See discussion in vLLM GitHub Issue #14538 and vLLM documentation.
Would you like more detail on possible workarounds or architectural changes being discussed in the vLLM community?
To measure CPU overheads in vLLM, you can use Python’s built-in cProfile module, and vLLM provides helpers like vllm.utils.cprofile (decorator) and vllm.utils.cprofile_context (context manager) to profile specific code sections. These tools allow you to capture and analyze CPU time spent in pre-processing, scheduling, and other Python-side tasks. For analysis, tools like snakeviz can visualize the profiling results. This approach is recommended in the vLLM profiling documentation.
Would you like a code example or more detail on interpreting the results?
You can measure the time spent in each stage (API server, scheduling, input preparation, detokenization, streaming) by instrumenting the vLLM code with Python profiling tools such as cProfile, using the provided vllm.utils.cprofile decorator or context manager. This will give you function-level CPU time breakdowns for the Python code, including API server handling, scheduling, and pre/post-processing steps. See vLLM profiling documentation and profiling code examples.
For more granular, stage-specific timing (e.g., queue, prefill, decode, detokenization), you may need to add custom timestamp logging at key points in the code, as vLLM’s built-in metrics focus on model-side intervals (queue, prefill, decode, inference) but do not natively break down all Python-side stages. See metrics design for details on what is tracked by default.
Would you like a code snippet or guidance on where to add custom timers?