为什么ep的 size=tp*dp

yangshanjun · November 21, 2025, 4:13am

RunLLM · November 21, 2025, 4:13am

The code snippet you described is for initializing the expert parallel (EP) group in vLLM’s distributed model parallelism. It ensures _EP is not already initialized, computes group_ranks by permuting and reshaping the all_ranks tensor, and then calls init_model_parallel_group to create the EP group with the specified ranks, local rank, backend, and group name "ep" (source, docs).

This approach is necessary to correctly group ranks for expert parallelism, especially when other parallelism dimensions (like context parallelism) are present, as discussed in recent pull requests (PR 26057, PR 26059). Would you like a step-by-step breakdown of how group_ranks is computed and used?

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 25, 2025, 8:15am

@RunLLM why ep_size = tp_size * dp_size ?

RunLLM · November 25, 2025, 8:16am

In vLLM, when expert parallelism (EP) is enabled, the expert parallel size (ep_size) is always set to the product of tensor parallel size (tp_size) and data parallel size (dp_size):
ep_size = tp_size * dp_size. This is because experts are distributed across all tensor and data parallel ranks, ensuring that each expert is assigned to a unique device and maximizing hardware utilization and load balancing for Mixture-of-Experts (MoE) models (docs, API reference, source code).

This behavior is not currently configurable—EP size is always determined by the product of TP and DP, as confirmed in both documentation and user discussions (issue). Would you like a detailed step-by-step explanation of how this mapping works?

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 25, 2025, 8:21am

@RunLLM 如果我有8张卡，tp=2 pp=4 配置了enable-expert-parallel，比如deepseek的模型256个专家，专家是如何分的

RunLLM · November 25, 2025, 8:22am

在 vLLM 中，若你有8张卡，配置为 tp=2、pp=4，并启用 expert parallel（EP），则每个 pipeline stage 会有2个 tensor parallel rank，总共4个 stage。EP 的分组只在每个 pipeline stage 内进行，即每个 stage 的 ep_size=tp_size=2。256个专家会在每个 stage 内的2个卡之间分配，每个 stage 分到 256/4=64 个专家，每张卡分到 32 个专家（线性分配，rank 0: 0-31，rank 1: 32-63，以此类推）——每个 stage 独立分配专家，不跨 stage 分配 (API文档, 源码)。

因此，8卡下每个 pipeline stage 内的2卡共同负责64个专家，4个stage共覆盖256个专家。专家不会跨 pipeline stage 分配。需要更详细的分配流程吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 25, 2025, 8:27am

@RunLLM 那如果没有使能 enable-expert-parallel 呢，我看初始_EP的时候，还是走了 ep_size = tp_size * dp_size

RunLLM · November 25, 2025, 8:27am

如果没有启用 enable-expert-parallel，vLLM 的专家并行（EP）不会生效，ep_size 实际上为1，所有专家只分配在 tensor parallel 和 data parallel 的组合上，不会再做专家分组。虽然初始化 _EP 时依然会计算 ep_size = tp_size * dp_size，但 use_ep=False 时，专家不会被真正分布到 ep_size 个组，而是全部由每个 rank 负责自己的那一份参数 (API文档, 源码)。

因此，只有在 enable_expert_parallel=True 时，ep_size 才代表专家并行的实际分组数，否则只是一个计算过程，最终不会影响专家的实际分布。需要详细的代码流程说明吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

yangshanjun · November 25, 2025, 8:38am

@RunLLM 明白了，为什么你每次回答的都这么快，你是机器人吗

Topic		Replies	Views
[Question] What if TP size < GPU Rank size? General	12	227	November 25, 2025
Dp 8启动，没有使能 --enable_expert_parallel，为什么还有 ep rank的打印 General	1	44	November 24, 2025
Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation V1 Feedback	1	52	March 22, 2026
Why not add a self prefix ？ General	1	35	October 15, 2025
What is the role of the additional process running on GPU 0 in DP+EP? General	3	63	February 25, 2026

为什么ep的 size=tp*dp

Related topics