I deployed Qwen3-235B-A22B-FPB on 8*H20-141G. Initially, I used the parallel strategy of --tensor-parallel-size 8 --enable-expert-parallel. Through nsys profiling, I found that there was a serious desync problem in nccl allreduce, which was caused by the unbalanced expert routing.
I noticed that vllm v0.10.1 added EPLB support in Qwen3-moe. So I tried the parameters --tensor-parallel-size 8 --enable-expert-parallel --enable-eplb --num_redundant_experts 16 --eplb_window_size 100 --eplb_step_interval 100. However, the performance deteriorated.
The command is: vllm serve /data/models/Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 --served-model-name “Qwen3-235B-A22B-Instruct-2507-FP8” --max-model-len 16384 --trust-remote-code --gpu-memory-utilization 0.96 --tensor-parallel-size 8 --enable-expert-parallel --enable-eplb --num_redundant_experts 16 --eplb_window_size 100 --eplb_step_interval 100 --port 8112
Could you please advise on how to configure the EPLB parameters in the deployment of Qwen3-MOE to achieve better performance?
For Qwen3-MoE with EPLB, vLLM recommends using a larger --eplb-window-size (e.g., 1000) and a larger --eplb-step-interval (e.g., 3000) to reduce overhead, as frequent rebalancing can degrade performance. Also, for large-scale deployments, set --num-redundant-experts to 32 to ensure popular experts are always available. Your current settings (--eplb-window-size 100 --eplb-step-interval 100) are too aggressive and likely cause excessive synchronization and communication overhead, leading to worse performance.
Try the following configuration for better load balancing and throughput:
Thank you for your reply.
I used --enable-eplb
–num_redundant_experts 16
–eplb-window-size 1000
–eplb-step-interval 3000
The above parameters have improved in performance compared to before, but they have not surpassed the performance of disable eplb. This is the 8* H20-141GB environment. I think the value of num_redundant_experts = 32 is too large. So i have chosen 16 instead.
Your observation is valid: while EPLB can improve load balance, it may introduce overhead, especially if the number of redundant experts is too high for your hardware or workload. The optimal value for --num-redundant-experts depends on the model, traffic pattern, and GPU memory; 16 is reasonable if 32 causes memory or performance issues. However, in some cases, disabling EPLB may still yield better throughput due to reduced synchronization and weight transfer overhead, especially on high-bandwidth, low-latency clusters or when expert imbalance is not severe.
If EPLB does not outperform the baseline, you can try further tuning: reduce --num-redundant-experts (e.g., 8 or 12), increase --eplb-step-interval (e.g., 5000), or disable --eplb-log-balancedness to minimize logging overhead. Ultimately, the best configuration is workload- and hardware-dependent, and some environments may benefit more from static expert assignment. For more details, see Expert Parallel Load Balancer (EPLB) documentation.
Would you like a step-by-step breakdown of how to tune these parameters for your specific setup?