Why does data parallel use both GPUs?

With --data-parallel-size=2, vLLM launches two engine processes, each on a separate GPU, and each is capable of handling requests independently. For dense models, a single request should only be processed by one DP rank (one GPU), so both GPUs should not be at 100% utilization for a single request—unless the request is large enough to saturate both, or there is some synchronization overhead (especially for MoE models, but GLM-4.5-Air is dense). If both GPUs are fully utilized for a single request, this may indicate a configuration or implementation detail, but by design, DP should balance requests across GPUs, not split a single request across them (Data Parallel Deployment).

Regarding logs: in DP mode, token/sec logs are produced by each engine process, not the main process. Currently, vLLM does not automatically forward per-rank logs to the main process; you would need to collect logs from each engine separately (Data Parallel Deployment). Would you like more detail on troubleshooting DP utilization or log aggregation?

Sources:

Hint: Mention @RunLLM in the post for followups.