Expert Parallelism All-to-All Communication without NVLink and DeepEP

Hi, I would like to deploy Deepseek-V2-Lite on my computer with 2 NVIDIA 3090 GPUs. There is not P2P communication channel between GPUs. By default, it seems to activate all-reduce-scatter for MoE layer. However, is there a way to activate NCCL all-to-all communication, where each GPU only communicate necessary expert access?

No, vLLM’s Expert Parallelism (EP) for MoE models like DeepSeek-V2-Lite requires a high-performance interconnect (such as NVLink or InfiniBand with IBGDA) to use optimized all-to-all communication backends (e.g., DeepEP or PPLX). On systems without P2P (peer-to-peer) GPU communication, vLLM defaults to the “naive” backend, which uses broadcast-based communication and does not provide efficient NCCL all-to-all collectives. The NCCL all-to-all backend is not available or recommended for multi-GPU setups lacking direct P2P connectivity, as it may hang or crash and is not officially supported in such environments. This limitation is documented in both the official docs and issue discussions.

If you attempt to use DeepEP or PPLX backends without IBGDA, vLLM will either fail to initialize or perform extremely poorly. There is currently no supported way to enable efficient NCCL all-to-all communication for MoE expert routing on consumer GPUs without P2P links; the naive backend is the only fallback, but it is not suitable for production or high-throughput use cases.

Would you like more detail on alternative deployment strategies or configuration options?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Will the naive all-reduce feature incur high communication costs even when there is no need for remote experts? Can you explain the mechanism a bit?

Yes, the naive all-reduce (broadcast-based) backend in vLLM incurs high communication costs even if no remote experts are needed. This is because, in the naive backend, all expert outputs are broadcasted or all-reduced across all GPUs, regardless of whether a particular expert was actually used by a remote rank. As a result, every GPU communicates the full set of expert outputs, not just the necessary ones, leading to unnecessary data transfer and high overhead, especially in multi-GPU setups without efficient P2P links. This mechanism is described in the vLLM expert parallel docs and confirmed in Slack discussions.

Would you like a step-by-step breakdown of how the naive backend works?

Sources:

Hint: Mention @RunLLM in the post for followups.