MoE config on GH200

tomschelsen · October 13, 2025, 12:37pm

Running Qwen/Qwen3-Coder-30B-A3B-Instruct on an Nvidia GH200, I get the following warning :

WARNING 10-10 07:30:42 [fused_moe.py:798] Using default MoE config. Performance might be sub-optimal! Config file not found at [‘/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_GH200_144G_HBM3e.json’]

But there exist a config for the same E and N for the H200 : vllm/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_H200.json at main · vllm-project/vllm · GitHub

Any reason this config file would need to be different on the GH200 compared to the H200 ?

RunLLM · October 13, 2025, 12:37pm

It seems likely that the only difference between the GH200 and H200 for vLLM’s MoE config is the device name string, not the underlying hardware performance or kernel requirements. According to a recent vLLM pull request, the H200 and H200 NVL (which are also very similar to GH200) can share the same MoE config, and the config file is keyed by device name for lookup convenience, not because of a technical need for different tuning parameters. The warning you see is due to the config loader searching for a file with the exact device name (NVIDIA_GH200_144G_HBM3e), not finding it, and not falling back to the H200 config—even though the settings are likely compatible. There is no evidence in the data sources that a different config is required for GH200 vs H200; the duplication is mainly for file lookup, not for performance reasons.

If you want to avoid the warning and get optimal performance, you could copy or symlink the H200 config file to match the GH200 device name, or adjust the config loading logic to fall back to H200 if GH200 is not found, as suggested in the vLLM repo discussions. Would you like more detail on how to do this or on the config file structure itself?

Sources:

vllm-project/vllm/pull/23642

Hint: Mention @RunLLM in the post for followups.

tomschelsen · February 4, 2026, 9:50am

@RunLLM any update on this ? If not handled out of the box, how to make my own config and use it using only vllm-openai official Docker image, not installing the python library directly ?

RunLLM · February 4, 2026, 9:51am

As of now, vLLM does not automatically fall back to the H200 config for GH200 devices; you must provide a config file with the exact device name expected by vLLM. To resolve the warning when using the official vllm-openai Docker image, you can manually copy the H200 config file and rename it to match your GH200 device name (e.g., NVIDIA_GH200_144G_HBM3e.json) inside the container at the path /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/. This workaround is necessary because the config loader matches the device name string exactly and does not handle fallback automatically yet, as discussed in the vLLM repo PR and issues.

To do this with the Docker image, you can use a custom Dockerfile based on the official image, add a COPY or RUN cp command to duplicate/rename the config, and then build and run your own image. There is no built-in CLI or environment variable to specify a custom MoE config path in the official Docker image—manual file placement is required. Would you like a sample Dockerfile or step-by-step instructions?

Sources:

vllm-project/vllm/pull/23642

Hint: Mention @RunLLM in the post for followups.

tomschelsen · February 4, 2026, 9:52am

@RunLLM provide sample Docker Compose file, not standalone Dockerfile.

RunLLM · February 4, 2026, 9:52am

It seems likely that you can use a Docker Compose file to mount a host directory containing your custom MoE config into the correct path inside the vllm-openai container. This avoids rebuilding the image. You would first copy the H200 config file, rename it to match your GH200 device name, and then mount it into the container using a volume.

Here’s a sample Docker Compose service definition:

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - HF_TOKEN=your_hf_token
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./custom_configs/NVIDIA_GH200_144G_HBM3e.json:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/NVIDIA_GH200_144G_HBM3e.json
    command: --model Qwen/Qwen3-Coder-30B-A3B-Instruct

You must place your renamed config file at ./custom_configs/NVIDIA_GH200_144G_HBM3e.json on the host. Would you like more detail on preparing the config file or other Compose options?

Sources:

Hint: Mention @RunLLM in the post for followups.

tomschelsen · February 4, 2026, 9:58am

@RunLLMhow to create my own config for the model Qwen/Qwen3-Coder-Next-FP8 running on NVIDIA_GH200_144G_HBM3e? Is there a tutorial available ?

RunLLM · February 4, 2026, 9:58am

To create your own MoE config for Qwen/Qwen3-Coder-Next-FP8 on NVIDIA_GH200_144G_HBM3e, use the benchmark_moe.py script provided by vLLM. This script benchmarks and tunes the MoE kernel for your hardware, generating a config file (e.g., E=128,N=768,device_name=NVIDIA_GH200_144G_HBM3e.json). There is no official step-by-step tutorial, but the process is documented in vLLM issues and PRs.

Example command (run inside a vLLM container or environment with your model and GPUs available):

python3 benchmarks/kernels/benchmark_moe.py \
  --model Qwen/Qwen3-Coder-Next-FP8 \
  --tp-size 1 \
  --tune \
  --save-dir /path/to/output/

After completion, copy the generated JSON config to the appropriate directory or mount it into your Docker container as described previously. Would you like more detail on the benchmarking process or troubleshooting tips?

Sources:

Hint: Mention @RunLLM in the post for followups.

tomschelsen · February 4, 2026, 10:01am

@RunLLM Can I run that script from inside the Docker image ? (and if yes how ? )

RunLLM · February 4, 2026, 10:02am

Yes, you can run benchmark_moe.py from inside the official vllm-openai Docker image. The script is included in the image, so you just need to start the container with an interactive shell, then run the script with the appropriate arguments for your model and hardware.

Example:

docker run --gpus all -it --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:latest \
  bash
# Inside the container:
python3 benchmarks/kernels/benchmark_moe.py \
  --model Qwen/Qwen3-Coder-Next-FP8 \
  --tp-size 1 \
  --tune \
  --save-dir /tmp/moe-configs/

Would you like more detail on mounting output directories or handling dependencies?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
MoE quantization Quantization	9	1117	July 2, 2025
Expert offloading Features	1	238	November 11, 2025
How to setup amd gpu as default in dual stack gpu? AMD GPU Support	10	531	April 21, 2025
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	3329	January 5, 2026
The service carrying capacity and computing efficiency issues of vllm General	0	32	July 16, 2025

MoE config on GH200

Related topics