Will Attention Layer Weights have multiple replicas?
Yes, in vLLM, attention layer weights can have multiple replicas depending on the parallelism strategy. If tensor parallelism is used, weights are sharded across GPUs; if not, weights are replicated on each GPU. For expert parallelism (EP) in MoE models, logical experts may have multiple physical replicas for load balancing, and the mapping between logical and physical experts is managed explicitly, resulting in multiple copies of expert weights on different devices as needed for redundancy and load distribution. Standard attention layers (not MoE experts) are typically sharded or replicated based on whether tensor parallelism is applied to them or not, as described in the documentation and discussions on model parallelism and expert parallelism in vLLM.
Would you like more detail on how these replicas are managed or how to configure them?
Sources:
- Contributing: Model - Tensor Parallelism and Quantization Support
- Expert Parallelism Load Balancer (EPLB) State
- Discussion on Model Parallelism
Hint: Mention @RunLLM in the post for followups.
@RunLLM What if tensor parallelism is used but I have more GPU than TP size? For example, TP=4 and I have 8 GPUs. Will that result in DP = 2?
No, setting tensor_parallel_size=4 on an 8-GPU node does not automatically enable data parallelism (DP=2). By default, only 4 GPUs will be used for tensor parallelism, and the other 4 GPUs will remain idle unless you explicitly configure data parallelism with --data-parallel-size=2 to utilize all 8 GPUs. vLLM does not infer or launch DP automatically based on excess GPUs; you must specify both TP and DP sizes to combine them and use all GPUs as intended.
Would you like more detail on how to configure both TP and DP together?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM explain class GroupCoordinator to me
GroupCoordinator is a class in vLLM that wraps a PyTorch ProcessGroup, managing communication among a group of processes for distributed inference. It handles both CPU and device (GPU) communication, supports different backends (NCCL, Gloo, MPI), and provides methods for collective operations like all-reduce, all-gather, broadcast, and barrier. It also tracks group membership, rank, and world size within the group, and can manage device communicators for efficient tensor operations.
Would you like more detail on its attributes, methods, or usage?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM Where do vllm use it when I enable DP/TP/EP? How is it initialized?
vLLM uses GroupCoordinator to manage communication groups for each parallelism type: data parallel (DP), tensor parallel (TP), and expert parallel (EP). When you enable DP, TP, or EP, vLLM initializes a GroupCoordinator for each group (e.g., TP group, DP group) during the distributed environment setup. This is done via functions like init_model_parallel_group, initialize_model_parallel, and related helpers, which create the necessary process groups and assign a GroupCoordinator to each, allowing collective operations within each parallel group.
Would you like more detail on the initialization flow or code examples?
Sources:
Hint: Mention @RunLLM in the post for followups.
Enable EP by setting the --enable-expert-parallel flag. The EP size is automatically calculated as:
EP_SIZE = TP_SIZE × DP_SIZE
Where:
-
TP_SIZE: Tensor parallel size (always 1 for now) -
DP_SIZE: Data parallel size -
EP_SIZE: Expert parallel size (computed automatically)
What if I set TP size not equal to 1?
How does vllm handle Attention Tensor Parallel?
If you set TP size not equal to 1, vLLM will use tensor parallelism for the attention layers, sharding their weights across TP_SIZE GPUs. For MoE models with --enable-expert-parallel, expert parallelism (EP) is applied to the expert (MoE) layers, while attention layers still use tensor parallelism. The EP size is calculated as TP_SIZE × DP_SIZE, and each parallelism type is managed independently for its respective layers. Attention tensor parallelism is fully supported and will function as expected when TP_SIZE > 1.
Would you like more detail on the interaction between TP and EP in MoE models?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM always 1 for now 这个表述让人迷惑
“always 1 for now”指的是在vLLM当前的EP(Expert Parallel)实现中,EP模式下TP_SIZE(张量并行数)只能设置为1,即MoE专家层只支持TP=1,不能和TP>1同时使用。此时Attention层不会被TP分片,而是每个进程都完整持有Attention权重。只有MoE专家层会被EP分片到不同GPU。
如果你设置TP_SIZE>1并启用EP,vLLM会自动将MoE专家层的TP_SIZE强制为1,仅对Attention等非MoE层使用TP分片。这样保证了MoE专家分布的正确性,但Attention层的TP仍然有效。该限制未来可能会放开。
需要详细解释TP和EP的混合行为吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 为什么 EP_SIZE = TP_SIZE × DP_SIZE, 能详细讲一下吗