Why does data parallel use both GPUs?

Hi!

I am running GLM 4.5 Air FP8 on two H200 GPUs with –data_parallel_size=2. I am using the vllm-openai docker image. I can confirm that the model was loaded into both GPUs.

However, when I now send a request to the model, both GPUs start working at 100% utilization. Why is that? Wouldn’t DP just balance multiple requests across the GPUs? So if there is only a single request, that should go to only one of the GPUs, shouldn’t it?

Also, with DP, I am not getting the token / sec logs anymore. Is there a way to have those forwarded to the main process?

Thank you.

With --data-parallel-size=2, vLLM launches two engine processes, each on a separate GPU, and each is capable of handling requests independently. For dense models, a single request should only be processed by one DP rank (one GPU), so both GPUs should not be at 100% utilization for a single request—unless the request is large enough to saturate both, or there is some synchronization overhead (especially for MoE models, but GLM-4.5-Air is dense). If both GPUs are fully utilized for a single request, this may indicate a configuration or implementation detail, but by design, DP should balance requests across GPUs, not split a single request across them (Data Parallel Deployment).

Regarding logs: in DP mode, token/sec logs are produced by each engine process, not the main process. Currently, vLLM does not automatically forward per-rank logs to the main process; you would need to collect logs from each engine separately (Data Parallel Deployment). Would you like more detail on troubleshooting DP utilization or log aggregation?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM GLM-4.5-Air-FP8 is MoE. And it fits one GPU, I’ve tried. So the question is: when using DP, why are both GPUs utilized? The command to start the docker was:

docker run --restart unless-stopped -e"TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken" --gpus \“device=0,1\” --ipc=host -it -v /home/user/.cache/huggingface:/root/.cache/huggingface -v /home/user/.cache/tiktoken:/root/.cache/tiktoken vllm/vllm-openai:latest zai-org/GLM-4.5-Air-FP8 --enable-prefix-caching --max-model-len 128000 --gpu-memory-utilization 0.90 --data-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --max-num-batched-tokens 8192 --max-num-seqs 16

For MoE models like GLM-4.5-Air-FP8, vLLM’s data parallel (DP) mode requires all DP ranks (GPUs) to synchronize during each forward pass, even if only one rank has an active request. This means that when you send a single request, both GPUs will be utilized and may run “dummy” forward passes to maintain synchronization (Data Parallel Deployment). This is necessary for correct MoE expert routing and is not avoidable in current vLLM DP MoE implementations.

Would you like more detail on how MoE synchronization works in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.