[Question] What if TP size < GPU Rank size?

xeonliu · November 25, 2025, 2:47am

Will Attention Layer Weights have multiple replicas?

RunLLM · November 25, 2025, 2:47am

Yes, in vLLM, attention layer weights can have multiple replicas depending on the parallelism strategy. If tensor parallelism is used, weights are sharded across GPUs; if not, weights are replicated on each GPU. For expert parallelism (EP) in MoE models, logical experts may have multiple physical replicas for load balancing, and the mapping between logical and physical experts is managed explicitly, resulting in multiple copies of expert weights on different devices as needed for redundancy and load distribution. Standard attention layers (not MoE experts) are typically sharded or replicated based on whether tensor parallelism is applied to them or not, as described in the documentation and discussions on model parallelism and expert parallelism in vLLM.

Would you like more detail on how these replicas are managed or how to configure them?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 25, 2025, 2:54am

@RunLLM What if tensor parallelism is used but I have more GPU than TP size? For example, TP=4 and I have 8 GPUs. Will that result in DP = 2?

RunLLM · November 25, 2025, 2:55am

No, setting tensor_parallel_size=4 on an 8-GPU node does not automatically enable data parallelism (DP=2). By default, only 4 GPUs will be used for tensor parallelism, and the other 4 GPUs will remain idle unless you explicitly configure data parallelism with --data-parallel-size=2 to utilize all 8 GPUs. vLLM does not infer or launch DP automatically based on excess GPUs; you must specify both TP and DP sizes to combine them and use all GPUs as intended.

Would you like more detail on how to configure both TP and DP together?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 25, 2025, 3:23am

@RunLLM explain class GroupCoordinator to me

RunLLM · November 25, 2025, 3:23am

GroupCoordinator is a class in vLLM that wraps a PyTorch ProcessGroup, managing communication among a group of processes for distributed inference. It handles both CPU and device (GPU) communication, supports different backends (NCCL, Gloo, MPI), and provides methods for collective operations like all-reduce, all-gather, broadcast, and barrier. It also tracks group membership, rank, and world size within the group, and can manage device communicators for efficient tensor operations.

Would you like more detail on its attributes, methods, or usage?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 25, 2025, 3:25am

@RunLLM Where do vllm use it when I enable DP/TP/EP? How is it initialized?

RunLLM · November 25, 2025, 3:26am

vLLM uses GroupCoordinator to manage communication groups for each parallelism type: data parallel (DP), tensor parallel (TP), and expert parallel (EP). When you enable DP, TP, or EP, vLLM initializes a GroupCoordinator for each group (e.g., TP group, DP group) during the distributed environment setup. This is done via functions like init_model_parallel_group, initialize_model_parallel, and related helpers, which create the necessary process groups and assign a GroupCoordinator to each, allowing collective operations within each parallel group.

Would you like more detail on the initialization flow or code examples?

Sources:

init_model_parallel_group

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 25, 2025, 7:42am

@RunLLM Configuration¶

Enable EP by setting the --enable-expert-parallel flag. The EP size is automatically calculated as:

EP_SIZE = TP_SIZE × DP_SIZE

Where:

TP_SIZE: Tensor parallel size (always 1 for now)
DP_SIZE: Data parallel size
EP_SIZE: Expert parallel size (computed automatically)

What if I set TP size not equal to 1?

How does vllm handle Attention Tensor Parallel?

RunLLM · November 25, 2025, 7:43am

If you set TP size not equal to 1, vLLM will use tensor parallelism for the attention layers, sharding their weights across TP_SIZE GPUs. For MoE models with --enable-expert-parallel, expert parallelism (EP) is applied to the expert (MoE) layers, while attention layers still use tensor parallelism. The EP size is calculated as TP_SIZE × DP_SIZE, and each parallelism type is managed independently for its respective layers. Attention tensor parallelism is fully supported and will function as expected when TP_SIZE > 1.

Would you like more detail on the interaction between TP and EP in MoE models?

Sources:

Hint: Mention @RunLLM in the post for followups.

xeonliu · November 25, 2025, 7:46am

@RunLLM always 1 for now 这个表述让人迷惑

RunLLM · November 25, 2025, 7:46am

“always 1 for now”指的是在vLLM当前的EP（Expert Parallel）实现中，EP模式下TP_SIZE（张量并行数）只能设置为1，即MoE专家层只支持TP=1，不能和TP>1同时使用。此时Attention层不会被TP分片，而是每个进程都完整持有Attention权重。只有MoE专家层会被EP分片到不同GPU。

如果你设置TP_SIZE>1并启用EP，vLLM会自动将MoE专家层的TP_SIZE强制为1，仅对Attention等非MoE层使用TP分片。这样保证了MoE专家分布的正确性，但Attention层的TP仍然有效。该限制未来可能会放开。

需要详细解释TP和EP的混合行为吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 25, 2025, 8:12am

@RunLLM 为什么 EP_SIZE = TP_SIZE × DP_SIZE，能详细讲一下吗

Topic		Replies	Views
为什么ep的 size=tp*dp General	8	178	November 25, 2025
Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation V1 Feedback	1	44	March 22, 2026
Data parallel or tensor parallel General	3	625	September 14, 2025
What is the role of the additional process running on GPU 0 in DP+EP? General	3	54	February 25, 2026
Is there recommended max upper limit for tensor parallel General	11	1284	July 26, 2025

[Question] What if TP size < GPU Rank size?

Related topics