The code snippet you described is for initializing the expert parallel (EP) group in vLLM’s distributed model parallelism. It ensures _EP is not already initialized, computes group_ranks by permuting and reshaping the all_ranks tensor, and then calls init_model_parallel_group to create the EP group with the specified ranks, local rank, backend, and group name "ep" (source, docs).
This approach is necessary to correctly group ranks for expert parallelism, especially when other parallelism dimensions (like context parallelism) are present, as discussed in recent pull requests (PR 26057, PR 26059). Would you like a step-by-step breakdown of how group_ranks is computed and used?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM why ep_size = tp_size * dp_size ?
In vLLM, when expert parallelism (EP) is enabled, the expert parallel size (ep_size) is always set to the product of tensor parallel size (tp_size) and data parallel size (dp_size):
ep_size = tp_size * dp_size. This is because experts are distributed across all tensor and data parallel ranks, ensuring that each expert is assigned to a unique device and maximizing hardware utilization and load balancing for Mixture-of-Experts (MoE) models (docs, API reference, source code).
This behavior is not currently configurable—EP size is always determined by the product of TP and DP, as confirmed in both documentation and user discussions (issue). Would you like a detailed step-by-step explanation of how this mapping works?
Sources:
- Expert Parallel Deployment docs
- FusedMoEParallelConfig.make API reference
- fused_moe/config.py source code
- vllm-project/vllm issue #27898
Hint: Mention @RunLLM in the post for followups.
@RunLLM 如果我有8张卡,tp=2 pp=4 配置了enable-expert-parallel,比如deepseek的模型256个专家,专家是如何分的
在 vLLM 中,若你有8张卡,配置为 tp=2、pp=4,并启用 expert parallel(EP),则每个 pipeline stage 会有2个 tensor parallel rank,总共4个 stage。EP 的分组只在每个 pipeline stage 内进行,即每个 stage 的 ep_size=tp_size=2。256个专家会在每个 stage 内的2个卡之间分配,每个 stage 分到 256/4=64 个专家,每张卡分到 32 个专家(线性分配,rank 0: 0-31,rank 1: 32-63,以此类推)——每个 stage 独立分配专家,不跨 stage 分配 (API文档, 源码)。
因此,8卡下每个 pipeline stage 内的2卡共同负责64个专家,4个stage共覆盖256个专家。专家不会跨 pipeline stage 分配。需要更详细的分配流程吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 那如果没有使能 enable-expert-parallel 呢, 我看初始_EP的时候,还是走了 ep_size = tp_size * dp_size
如果没有启用 enable-expert-parallel,vLLM 的专家并行(EP)不会生效,ep_size 实际上为1,所有专家只分配在 tensor parallel 和 data parallel 的组合上,不会再做专家分组。虽然初始化 _EP 时依然会计算 ep_size = tp_size * dp_size,但 use_ep=False 时,专家不会被真正分布到 ep_size 个组,而是全部由每个 rank 负责自己的那一份参数 (API文档, 源码)。
因此,只有在 enable_expert_parallel=True 时,ep_size 才代表专家并行的实际分组数,否则只是一个计算过程,最终不会影响专家的实际分布。需要详细的代码流程说明吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 明白了,为什么你每次回答的都这么快,你是机器人吗
