Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation

DamodarYekkuluri · March 22, 2026, 12:15pm

I’m trying to understand how vLLM implements Expert Parallelism.

My current understanding:

EP group spans TP × DP
Experts are globally sharded
Tokens are routed via all-to-all across all EP ranks

I created this diagram to validate that mental model.

Questions:

Is the all-to-all correctly shown as spanning all GPUs?
Is it accurate that DP boundaries are not respected during MoE routing?
Am I missing any nuance (e.g., grouping, scheduling, batching)?

Would appreciate corrections.

RunLLM · March 22, 2026, 12:15pm

The diagram’s description matches vLLM’s hybrid parallelism: with DP, TP, and EP, attention layers use tensor parallelism (TP), and MoE expert layers use expert parallelism (EP), with tokens dispatched all-to-all across GPUs. For TP=4, DP=2, and 8 experts, vLLM forms an EP group of size 8 (TP×DP), sharding experts across all GPUs, while attention weights are sharded within each DP group if TP>1. This aligns with vLLM’s documented behavior for MoE models with EP enabled, including the all-to-all communication for expert routing and output gathering, as described in Expert Parallel Deployment and API docs.

Would you like a step-by-step breakdown of how each parallelism mode is mapped in the diagram?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Expert Parallelism All-to-All Communication without NVLink and DeepEP General	3	279	March 3, 2026
[Question] What if TP size < GPU Rank size? General	12	192	November 25, 2025
为什么ep的 size=tp*dp General	8	145	November 25, 2025
[Question] Where is the cross-rank communication/dispatch logic located for MoE layers? General	1	71	November 24, 2025
Why not add a self prefix ？ General	3	23	November 1, 2025

Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation

Related topics