The diagram’s description matches vLLM’s hybrid parallelism: with DP, TP, and EP, attention layers use tensor parallelism (TP), and MoE expert layers use expert parallelism (EP), with tokens dispatched all-to-all across GPUs. For TP=4, DP=2, and 8 experts, vLLM forms an EP group of size 8 (TP×DP), sharding experts across all GPUs, while attention weights are sharded within each DP group if TP>1. This aligns with vLLM’s documented behavior for MoE models with EP enabled, including the all-to-all communication for expert routing and output gathering, as described in Expert Parallel Deployment and API docs.
Would you like a step-by-step breakdown of how each parallelism mode is mapped in the diagram?