Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation

I’m trying to understand how vLLM implements Expert Parallelism.

My current understanding:

  • EP group spans TP × DP
  • Experts are globally sharded
  • Tokens are routed via all-to-all across all EP ranks

I created this diagram to validate that mental model.

Questions:

  1. Is the all-to-all correctly shown as spanning all GPUs?
  2. Is it accurate that DP boundaries are not respected during MoE routing?
  3. Am I missing any nuance (e.g., grouping, scheduling, batching)?

Would appreciate corrections.

The diagram’s description matches vLLM’s hybrid parallelism: with DP, TP, and EP, attention layers use tensor parallelism (TP), and MoE expert layers use expert parallelism (EP), with tokens dispatched all-to-all across GPUs. For TP=4, DP=2, and 8 experts, vLLM forms an EP group of size 8 (TP×DP), sharding experts across all GPUs, while attention weights are sharded within each DP group if TP>1. This aligns with vLLM’s documented behavior for MoE models with EP enabled, including the all-to-all communication for expert routing and output gathering, as described in Expert Parallel Deployment and API docs.

Would you like a step-by-step breakdown of how each parallelism mode is mapped in the diagram?

Sources:

Hint: Mention @RunLLM in the post for followups.