Expert Parallelism All-to-All Communication without NVLink and DeepEP

Tianchen2001 · March 3, 2026, 8:43pm

Hi, I would like to deploy Deepseek-V2-Lite on my computer with 2 NVIDIA 3090 GPUs. There is not P2P communication channel between GPUs. By default, it seems to activate all-reduce-scatter for MoE layer. However, is there a way to activate NCCL all-to-all communication, where each GPU only communicate necessary expert access?

RunLLM · March 3, 2026, 8:44pm

No, vLLM’s Expert Parallelism (EP) for MoE models like DeepSeek-V2-Lite requires a high-performance interconnect (such as NVLink or InfiniBand with IBGDA) to use optimized all-to-all communication backends (e.g., DeepEP or PPLX). On systems without P2P (peer-to-peer) GPU communication, vLLM defaults to the “naive” backend, which uses broadcast-based communication and does not provide efficient NCCL all-to-all collectives. The NCCL all-to-all backend is not available or recommended for multi-GPU setups lacking direct P2P connectivity, as it may hang or crash and is not officially supported in such environments. This limitation is documented in both the official docs and issue discussions.

If you attempt to use DeepEP or PPLX backends without IBGDA, vLLM will either fail to initialize or perform extremely poorly. There is currently no supported way to enable efficient NCCL all-to-all communication for MoE expert routing on consumer GPUs without P2P links; the naive backend is the only fallback, but it is not suitable for production or high-throughput use cases.

Would you like more detail on alternative deployment strategies or configuration options?

Sources:

Hint: Mention @RunLLM in the post for followups.

Tianchen2001 · March 3, 2026, 8:47pm

@RunLLM Will the naive all-reduce feature incur high communication costs even when there is no need for remote experts? Can you explain the mechanism a bit?

RunLLM · March 3, 2026, 8:48pm

Yes, the naive all-reduce (broadcast-based) backend in vLLM incurs high communication costs even if no remote experts are needed. This is because, in the naive backend, all expert outputs are broadcasted or all-reduced across all GPUs, regardless of whether a particular expert was actually used by a remote rank. As a result, every GPU communicates the full set of expert outputs, not just the necessary ones, leading to unnecessary data transfer and high overhead, especially in multi-GPU setups without efficient P2P links. This mechanism is described in the vLLM expert parallel docs and confirmed in Slack discussions.

Would you like a step-by-step breakdown of how the naive backend works?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Clarification: EP All-to-All Communication Across TP×DP — Diagram Validation V1 Feedback	1	43	March 22, 2026
High-Throughput kernel on single-node Benchmarking	1	197	June 23, 2025
vLLM hangs during worker initialization on Blackwell PCIe GPUs unless --disable-custom-all-reduce is used NVIDIA GPU Support	1	474	April 11, 2026
What means there is no P2P support General	4	719	November 16, 2025
NCCL error across 2 machines 2x4GPUs need advice General	2	493	November 12, 2025

Expert Parallelism All-to-All Communication without NVLink and DeepEP

Related topics