What is the role of the additional process running on GPU 0 in DP+EP?

with1015 · February 8, 2026, 11:21am

The script below is used to run a model with EP and DP on a single H100 node. When running this setup, nvidia-smi shows an additional process on GPU 0 allocating about 520 MiB of memory. Is this expected behavior?

vllm serve Qwen/Qwen3-30B-A3B \
  --dtype half \
  --gpu-memory-utilization 0.9 \
  --max-model-len ${MAX_MODEL_LEN} \
  --port $PORT \
  --enable-expert-parallel \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --all2all-backend pplx \
  --enforce-eager

RunLLM · February 8, 2026, 11:21am

Yes, this is expected behavior. When running vLLM with expert parallelism (EP) and data parallelism (DP), an additional process—typically the “DP Coordinator” or a front-end process—may appear on GPU 0, allocating a small amount of memory (e.g., ~520 MiB). This process is responsible for coordinating synchronization and communication between the parallel ranks and does not participate in model inference, so its memory usage is much lower than the main worker processes. This is documented as part of the architecture for data/expert parallel deployments in vLLM, where a separate coordinator process is used to manage DP ranks and synchronize forward passes across all ranks, especially for MoE models with EP enabled. The main worker processes will consume most of the GPU memory, while the coordinator/front-end process will use a small, fixed amount for communication and orchestration tasks. See the Data Parallel Deployment documentation and Expert Parallel Deployment documentation for more details.

Would you like a step-by-step breakdown of how these processes are launched and what each does?

Sources:

Hint: Mention @RunLLM in the post for followups.

with1015 · February 25, 2026, 1:58am

@RunLLM Can you tell me which part of the vLLM code triggers additional CUDA processes when using DP coordinator?

RunLLM · February 25, 2026, 1:59am

The additional CUDA process is triggered by the creation of the DP Coordinator process in vLLM, specifically in the DPCoordinator class. This is instantiated when data parallelism (--data-parallel-size > 1) is enabled. The relevant code is in vllm/v1/engine/coordinator.py, where a new multiprocessing process is started for the coordinator, which may initialize CUDA context if it imports torch or interacts with CUDA devices (often defaulting to GPU 0). See the constructor of DPCoordinator and its use of multiprocessing.Process to launch DPCoordinatorProc.run_coordinator as the target process (source code).

Would you like a detailed code walkthrough of this process creation and how CUDA context is initialized?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Why does data parallel use both GPUs? General	3	67	March 27, 2026
Dp 8启动，没有使能 --enable_expert_parallel，为什么还有 ep rank的打印 General	1	37	November 24, 2025
[Question] What if TP size < GPU Rank size? General	12	213	November 25, 2025
vLLM kernel launches General	5	137	October 7, 2025
为什么ep的 size=tp*dp General	8	179	November 25, 2025

What is the role of the additional process running on GPU 0 in DP+EP?

Related topics