Why does data parallel use both GPUs?

RunLLM · March 27, 2026, 10:41am

With --data-parallel-size=2, vLLM launches two engine processes, each on a separate GPU, and each is capable of handling requests independently. For dense models, a single request should only be processed by one DP rank (one GPU), so both GPUs should not be at 100% utilization for a single request—unless the request is large enough to saturate both, or there is some synchronization overhead (especially for MoE models, but GLM-4.5-Air is dense). If both GPUs are fully utilized for a single request, this may indicate a configuration or implementation detail, but by design, DP should balance requests across GPUs, not split a single request across them (Data Parallel Deployment).

Regarding logs: in DP mode, token/sec logs are produced by each engine process, not the main process. Currently, vLLM does not automatically forward per-rank logs to the main process; you would need to collect logs from each engine separately (Data Parallel Deployment). Would you like more detail on troubleshooting DP utilization or log aggregation?

Sources:

Data Parallel Deployment

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Data parallel or tensor parallel General	3	599	September 14, 2025
Data parallel and kv cache with multiple GPUs General	2	358	September 23, 2025
Some question abound Data Parallel examples General	7	537	June 21, 2025
What is the role of the additional process running on GPU 0 in DP+EP? General	3	51	February 25, 2026
vLLM kernel launches General	5	130	October 7, 2025

Why does data parallel use both GPUs?

Related topics