Some question abound Data Parallel examples

I am very interested in data parallelism for offline inference, and I have seen the official example at vllm/examples/offline_inference/data_parallel.py at main · vllm-project/vllm · GitHub. I have a few questions:

  1. In this example, since the data is allocated outside the engine, why do we need to set values like VLLM_DP_RANK for different processes to initialize communication within different engines? Are there any other areas where communication for the DP scope needs to occur within the engine?
  2. In a single-machine multi-GPU environment, how does the efficiency of this data parallel setup, which starts multiple engines for inference, compare with that of a single engine performing inference across multiple GPUs?
  3. Can I replace LLMEngine with AsyncLLMEngine based on this data parallel program?

In the official data parallel offline inference example, environment variables like VLLM_DP_RANK, VLLM_DP_SIZE, etc., are set so each process (engine) knows its data parallel rank and can establish the necessary communication channels for features like expert parallelism (MoE), synchronizing CUDA graphs, and collective operations (e.g., all-reduce) across DP ranks. Even though data is sharded outside the engine, these variables are required for correct initialization and coordination of DP-related logic inside vLLM, especially for models or features that require cross-rank communication (e.g., MoE sharding, dummy batch sync) (source, source).

On a single machine, running multiple engines (one per DP rank) for data parallel inference is generally less efficient than using a single engine with tensor parallelism across all GPUs, because each engine loads a full model replica, increasing memory usage and potentially reducing cache efficiency. Data parallelism is best for scaling throughput when you have enough GPU memory to replicate the model, while tensor parallelism is preferred for large models that don’t fit on a single GPU (source). As for replacing LLMEngine with AsyncLLMEngine in this offline DP example: this is not directly supported, as AsyncLLMEngine is designed for online serving and streaming, not for the batch-style offline inference pattern shown in the example (source, source).

Would you like a step-by-step breakdown of how DP communication works inside the engine, or a deeper comparison of efficiency between DP and TP setups?

Sources:

so anohter basic question:

If I deploy an LLMEngine/AsyncLLMEngine using 8 GPUs, compared to using a single GPU to deploy an LLMEngine/AsyncLLMEngine, will there be a difference in efficiency? If I don’t use TP (tensor parallelism), in a multi-GPU setup, will multiple GPUs be used to initialize the KVCache?

If you deploy an LLMEngine or AsyncLLMEngine on 8 GPUs without enabling tensor parallelism (tensor_parallel_size=1), only one GPU will be used for both model weights and KVCache; the other GPUs will remain idle. There is no efficiency gain in this case compared to using a single GPU, and the KVCache will not be initialized across multiple GPUs (source, source).

To utilize multiple GPUs for model weights and KVCache, you must set tensor_parallel_size to the number of GPUs you want to use. Otherwise, vLLM will not automatically distribute the workload or KVCache across GPUs (source).

Would you like more detail on how tensor parallelism affects KVCache allocation and inference efficiency?

Sources:

If I have multiple GPUs and I want to use AsyncEngine for online serving, but my model can run efficiently on a single GPU, how can I leverage the computational power of multiple GPUs for inference? If I set TP=2 with two GPUs, will the inference be faster compared to using TP=1 on a single GPU?

If your model fits and runs efficiently on a single GPU, using tensor parallelism (TP=2) across two GPUs will not necessarily make inference faster; in fact, it may introduce communication overhead that can reduce throughput or increase latency, especially for small models or when GPUs are not connected via high-speed links like NVLink (source, source). For small models, the recommended way to leverage multiple GPUs is to run multiple independent vLLM instances (one per GPU) and distribute requests among them, effectively achieving data parallelism (source, source).

Would you like more detail on how to set up data parallel serving for small models?

Sources:

I noticed that the PR [V1] AsyncLLM data parallel by njhill · Pull Request #13923 · vllm-project/vllm · GitHub supports data parallelism for AsyncLLM. Perhaps this parameter should be passed directly.

Yes, the PR #13923 adds support for data parallelism in AsyncLLMEngine, allowing you to specify the data parallel size (e.g., --data-parallel-size N or data_parallel_size=N) directly as a parameter. This enables the engine to launch multiple model replicas and balance requests across them for online serving.

Would you like more detail on how to configure and use this parameter in practice?

Sources: