When there are multiple API servers (defined by –api-server-count) and multiple Engine cores (–data-parallel-size-local), how many CPU threads are used for launching GPU kernels? What is the role of the Launcher and the DP Coordinator?
Each engine core (i.e., each data parallel rank) launches its own process, and each process is responsible for running GPU kernels. The number of CPU threads used for launching GPU kernels per engine core is not explicitly specified in the data sources, but each engine core process typically uses one main thread for kernel launches, with additional threads as needed for internal operations. The API server processes (set by --api-server-count) handle input processing and request distribution, but do not launch GPU kernels themselves—this is handled by the engine core processes only. Therefore, the total number of CPU threads launching GPU kernels equals the number of engine core processes, which is determined by --data-parallel-size-local (and possibly multiplied by tensor parallel size if used) (source, source, source).
The Launcher is responsible for starting the engine core and API server processes, managing their lifecycles, and setting up inter-process communication. The DP Coordinator is a separate process used when data parallelism (DP > 1) is enabled; it collects stats from each engine core, coordinates request “waves” (synchronizing when all ranks are idle or running), and publishes load information to the API servers for intelligent request routing and load balancing (source, source, source).
Would you like a more detailed breakdown of the process and thread model or the internal communication flow?
Sources:
- https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html
- https://docs.vllm.ai/en/latest/configuration/optimization.html#__span-1-4
- https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/cli/serve.py
- https://docs.vllm.ai/en/latest/api/vllm/v1/engine/coordinator.html
- https://github.com/vllm-project/vllm/blob/main/vllm/v1/engine/coordinator.py
Hint: Mention @RunLLM in the post for followups.
@RunLLM what are the usecases for multiple Engine Cores?
Multiple Engine Cores (i.e., multiple data parallel ranks) are used to increase throughput by replicating the model across several GPUs or nodes, allowing independent batches of requests to be processed in parallel. This is especially beneficial for serving many concurrent users, maximizing GPU utilization, and scaling up inference in both dense and Mixture-of-Experts (MoE) models (Data Parallel Deployment, Parallelism and Scaling).
Use cases include high-throughput online serving, multi-user environments, and scenarios where isolation between request batches is needed. For MoE models, multiple engine cores enable efficient expert parallelism and synchronization across expert layers (Data Parallel Deployment).
Would you like more detail or example deployment scenarios?
Sources:
- https://docs.vllm.ai/en/latest/serving/data_parallel_deployment.html
- https://docs.vllm.ai/en/latest/serving/parallelism_scaling.html
Hint: Mention @RunLLM in the post for followups.
@RunLLM If my request/s is already saturated then this is unlikely to bring any benefit right?
Correct—if a single engine core is already saturated (i.e., fully utilizing the GPU and processing requests at maximum throughput), adding more engine cores (data parallelism) will not improve performance for that specific request or user. Data parallelism mainly benefits scenarios with many concurrent, independent requests by distributing the load, but does not speed up a single saturated request (Data Parallel Deployment).
Would you like more detail on how to measure saturation or when scaling out is beneficial?
Sources:
Hint: Mention @RunLLM in the post for followups.