Why does data parallel use both GPUs?

DavidNemeskey · March 27, 2026, 10:41am

Hi!

I am running GLM 4.5 Air FP8 on two H200 GPUs with –data_parallel_size=2. I am using the vllm-openai docker image. I can confirm that the model was loaded into both GPUs.

However, when I now send a request to the model, both GPUs start working at 100% utilization. Why is that? Wouldn’t DP just balance multiple requests across the GPUs? So if there is only a single request, that should go to only one of the GPUs, shouldn’t it?

Also, with DP, I am not getting the token / sec logs anymore. Is there a way to have those forwarded to the main process?

Thank you.

RunLLM · March 27, 2026, 10:41am

With --data-parallel-size=2, vLLM launches two engine processes, each on a separate GPU, and each is capable of handling requests independently. For dense models, a single request should only be processed by one DP rank (one GPU), so both GPUs should not be at 100% utilization for a single request—unless the request is large enough to saturate both, or there is some synchronization overhead (especially for MoE models, but GLM-4.5-Air is dense). If both GPUs are fully utilized for a single request, this may indicate a configuration or implementation detail, but by design, DP should balance requests across GPUs, not split a single request across them (Data Parallel Deployment).

Regarding logs: in DP mode, token/sec logs are produced by each engine process, not the main process. Currently, vLLM does not automatically forward per-rank logs to the main process; you would need to collect logs from each engine separately (Data Parallel Deployment). Would you like more detail on troubleshooting DP utilization or log aggregation?

Sources:

Data Parallel Deployment

Hint: Mention @RunLLM in the post for followups.

DavidNemeskey · March 27, 2026, 11:08am

@RunLLM GLM-4.5-Air-FP8 is MoE. And it fits one GPU, I’ve tried. So the question is: when using DP, why are both GPUs utilized? The command to start the docker was:

docker run --restart unless-stopped -e"TIKTOKEN_RS_CACHE_DIR=/root/.cache/tiktoken" --gpus \“device=0,1\” --ipc=host -it -v /home/user/.cache/huggingface:/root/.cache/huggingface -v /home/user/.cache/tiktoken:/root/.cache/tiktoken vllm/vllm-openai:latest zai-org/GLM-4.5-Air-FP8 --enable-prefix-caching --max-model-len 128000 --gpu-memory-utilization 0.90 --data-parallel-size 2 --tool-call-parser glm45 --reasoning-parser glm45 --enable-auto-tool-choice --max-num-batched-tokens 8192 --max-num-seqs 16

RunLLM · March 27, 2026, 11:08am

For MoE models like GLM-4.5-Air-FP8, vLLM’s data parallel (DP) mode requires all DP ranks (GPUs) to synchronize during each forward pass, even if only one rank has an active request. This means that when you send a single request, both GPUs will be utilized and may run “dummy” forward passes to maintain synchronization (Data Parallel Deployment). This is necessary for correct MoE expert routing and is not avoidable in current vLLM DP MoE implementations.

Would you like more detail on how MoE synchronization works in vLLM?

Sources:

Data Parallel Deployment

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Data parallel or tensor parallel General	3	414	September 14, 2025
Data parallel and kv cache with multiple GPUs General	2	246	September 23, 2025
Some question abound Data Parallel examples General	7	406	June 21, 2025
What is the role of the additional process running on GPU 0 in DP+EP? General	3	24	February 25, 2026
vLLM kernel launches General	5	101	October 7, 2025

Why does data parallel use both GPUs?

Related topics