When testing the Qwen3 model on a single-node L20 (PCIe) system, if I want to use EP, can I use the High-Throughput kernel ? If that is faster than native all2ll?
vLLM supports Expert Parallelism (EP) for MoE models like Qwen3, and it includes a custom “fast allreduce” (High-Throughput) kernel that can improve performance, especially on NVLink systems. However, on PCIe-based systems like L20, the performance gain from the fast allreduce kernel is limited compared to NVLink, and in some cases, it may not outperform the native NCCL all2all (all2ll) communication. The fast allreduce kernel is mainly a latency optimization and is most effective when peer-to-peer (P2P) access is available between GPUs, which is less optimal on PCIe-only setups. You can enable or disable the fast allreduce kernel with the disable_fast_allreduce
flag for benchmarking purposes and compare performance directly on your hardware.
According to user reports and vLLM contributors, NVLink-based systems see higher performance improvements with the fast allreduce kernel, while PCIe systems may see little to no benefit or even regressions in some scenarios. It is recommended to benchmark both options on your L20 system to determine which is faster for your specific workload and batch size. See the discussion and benchmarks in Custom all reduce kernels and related comments for more details.
Would you like a step-by-step guide on how to enable/disable the fast allreduce kernel and benchmark both options?
Sources: