Using vllm/vllm-openai:cu130-nightly with 2x5090 getting this:
trying to run Qwen/Qwen3.6-35B-A3B
vllm 0.19.2
WARNING 04-19 09:31:42 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=256,device_name=NVIDIA_GeForce_RTX_5090,dtype=fp8_w8a8.json
This warning means vLLM is using a generic (untuned) MoE kernel config for your RTX 5090, which may reduce performance. To resolve, you need to tune the MoE kernel for your hardware using the benchmark_moe.py script, then set the VLLM_TUNED_CONFIG_FOLDER environment variable to the directory containing the generated config file. This will load the optimized config and improve throughput and latency (Qwen3-Next Usage Guide, fused_moe.py source).
Would you like detailed step-by-step instructions for tuning and deploying with the optimized config?