What is the role of the additional process running on GPU 0 in DP+EP?

The script below is used to run a model with EP and DP on a single H100 node. When running this setup, nvidia-smi shows an additional process on GPU 0 allocating about 520 MiB of memory. Is this expected behavior?

vllm serve Qwen/Qwen3-30B-A3B \
  --dtype half \
  --gpu-memory-utilization 0.9 \
  --max-model-len ${MAX_MODEL_LEN} \
  --port $PORT \
  --enable-expert-parallel \
  --tensor-parallel-size 4 \
  --data-parallel-size 2 \
  --all2all-backend pplx \
  --enforce-eager

Yes, this is expected behavior. When running vLLM with expert parallelism (EP) and data parallelism (DP), an additional process—typically the “DP Coordinator” or a front-end process—may appear on GPU 0, allocating a small amount of memory (e.g., ~520 MiB). This process is responsible for coordinating synchronization and communication between the parallel ranks and does not participate in model inference, so its memory usage is much lower than the main worker processes. This is documented as part of the architecture for data/expert parallel deployments in vLLM, where a separate coordinator process is used to manage DP ranks and synchronize forward passes across all ranks, especially for MoE models with EP enabled. The main worker processes will consume most of the GPU memory, while the coordinator/front-end process will use a small, fixed amount for communication and orchestration tasks. See the Data Parallel Deployment documentation and Expert Parallel Deployment documentation for more details.

Would you like a step-by-step breakdown of how these processes are launched and what each does?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Can you tell me which part of the vLLM code triggers additional CUDA processes when using DP coordinator?

The additional CUDA process is triggered by the creation of the DP Coordinator process in vLLM, specifically in the DPCoordinator class. This is instantiated when data parallelism (--data-parallel-size > 1) is enabled. The relevant code is in vllm/v1/engine/coordinator.py, where a new multiprocessing process is started for the coordinator, which may initialize CUDA context if it imports torch or interacts with CUDA devices (often defaulting to GPU 0). See the constructor of DPCoordinator and its use of multiprocessing.Process to launch DPCoordinatorProc.run_coordinator as the target process (source code).

Would you like a detailed code walkthrough of this process creation and how CUDA context is initialized?

Sources:

Hint: Mention @RunLLM in the post for followups.