Running Qwen/Qwen3-Coder-30B-A3B-Instruct on an Nvidia GH200, I get the following warning :
WARNING 10-10 07:30:42 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at [‘/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_GH200_144G_HBM3e.json’]
It seems likely that the only difference between the GH200 and H200 for vLLM’s MoE config is the device name string, not the underlying hardware performance or kernel requirements. According to a recent vLLM pull request, the H200 and H200 NVL (which are also very similar to GH200) can share the same MoE config, and the config file is keyed by device name for lookup convenience, not because of a technical need for different tuning parameters. The warning you see is due to the config loader searching for a file with the exact device name (NVIDIA_GH200_144G_HBM3e), not finding it, and not falling back to the H200 config—even though the settings are likely compatible. There is no evidence in the data sources that a different config is required for GH200 vs H200; the duplication is mainly for file lookup, not for performance reasons.
If you want to avoid the warning and get optimal performance, you could copy or symlink the H200 config file to match the GH200 device name, or adjust the config loading logic to fall back to H200 if GH200 is not found, as suggested in the vLLM repo discussions. Would you like more detail on how to do this or on the config file structure itself?
@RunLLM any update on this ? If not handled out of the box, how to make my own config and use it using only vllm-openai official Docker image, not installing the python library directly ?
As of now, vLLM does not automatically fall back to the H200 config for GH200 devices; you must provide a config file with the exact device name expected by vLLM. To resolve the warning when using the official vllm-openai Docker image, you can manually copy the H200 config file and rename it to match your GH200 device name (e.g., NVIDIA_GH200_144G_HBM3e.json) inside the container at the path /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/. This workaround is necessary because the config loader matches the device name string exactly and does not handle fallback automatically yet, as discussed in the vLLM repo PR and issues.
To do this with the Docker image, you can use a custom Dockerfile based on the official image, add a COPY or RUN cp command to duplicate/rename the config, and then build and run your own image. There is no built-in CLI or environment variable to specify a custom MoE config path in the official Docker image—manual file placement is required. Would you like a sample Dockerfile or step-by-step instructions?
It seems likely that you can use a Docker Compose file to mount a host directory containing your custom MoE config into the correct path inside the vllm-openai container. This avoids rebuilding the image. You would first copy the H200 config file, rename it to match your GH200 device name, and then mount it into the container using a volume.
Here’s a sample Docker Compose service definition:
You must place your renamed config file at ./custom_configs/NVIDIA_GH200_144G_HBM3e.json on the host. Would you like more detail on preparing the config file or other Compose options?
To create your own MoE config for Qwen/Qwen3-Coder-Next-FP8 on NVIDIA_GH200_144G_HBM3e, use the benchmark_moe.py script provided by vLLM. This script benchmarks and tunes the MoE kernel for your hardware, generating a config file (e.g., E=128,N=768,device_name=NVIDIA_GH200_144G_HBM3e.json). There is no official step-by-step tutorial, but the process is documented in vLLM issues and PRs.
Example command (run inside a vLLM container or environment with your model and GPUs available):
After completion, copy the generated JSON config to the appropriate directory or mount it into your Docker container as described previously. Would you like more detail on the benchmarking process or troubleshooting tips?
Yes, you can run benchmark_moe.py from inside the official vllm-openai Docker image. The script is included in the image, so you just need to start the container with an interactive shell, then run the script with the appropriate arguments for your model and hardware.
Example:
docker run --gpus all -it --ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
bash
# Inside the container:
python3 benchmarks/kernels/benchmark_moe.py \
--model Qwen/Qwen3-Coder-Next-FP8 \
--tp-size 1 \
--tune \
--save-dir /tmp/moe-configs/
Would you like more detail on mounting output directories or handling dependencies?