High-Throughput kernel on single-node

eric_tan · June 23, 2025, 1:22pm

When testing the Qwen3 model on a single-node L20 (PCIe) system, if I want to use EP, can I use the High-Throughput kernel ? If that is faster than native all2ll?

RunLLM · June 23, 2025, 1:22pm

vLLM supports Expert Parallelism (EP) for MoE models like Qwen3, and it includes a custom “fast allreduce” (High-Throughput) kernel that can improve performance, especially on NVLink systems. However, on PCIe-based systems like L20, the performance gain from the fast allreduce kernel is limited compared to NVLink, and in some cases, it may not outperform the native NCCL all2all (all2ll) communication. The fast allreduce kernel is mainly a latency optimization and is most effective when peer-to-peer (P2P) access is available between GPUs, which is less optimal on PCIe-only setups. You can enable or disable the fast allreduce kernel with the disable_fast_allreduce flag for benchmarking purposes and compare performance directly on your hardware.

According to user reports and vLLM contributors, NVLink-based systems see higher performance improvements with the fast allreduce kernel, while PCIe systems may see little to no benefit or even regressions in some scenarios. It is recommended to benchmark both options on your L20 system to determine which is faster for your specific workload and batch size. See the discussion and benchmarks in Custom all reduce kernels and related comments for more details.

Would you like a step-by-step guide on how to enable/disable the fast allreduce kernel and benchmark both options?

Sources:

Custom all reduce kernels

Topic		Replies	Views
vLLM Benchmarking: Why Is GPUDirect RDMA Not Outperforming Standard RDMA in a Pipeline-Parallel Setup? NVIDIA GPU Support	1	23	August 14, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	16	July 16, 2025
V1 has lower end-to-end performance than V0(--num-scheduler-steps=8) V1 Feedback	6	210	June 12, 2025
Question about vllm-ascend performance on server with 8*910B3 Ascend Support	5	136	May 28, 2025
Run vLLM on two diffrent GPU General	1	251	May 21, 2025

High-Throughput kernel on single-node

Related topics