Support for MiniMax-M2.5

dingfen · April 14, 2026, 7:45am

I am trying to run MiniMax-M2.5 on H200, and I have built a docker with vLLM 0.16.0 installed. How can I run MiniMax-M2.5 for best performance?

RunLLM · April 14, 2026, 7:45am

For best performance running MiniMax-M2.5 on H200 GPUs with vLLM, use tensor parallelism (TP) and expert parallelism (EP). The recommended config for 4x H200 is: --tensor-parallel-size 4 --enable-expert-parallel --tool-call-parser minimax_m2 --reasoning-parser minimax_m2 --enable-auto-tool-choice --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' --trust-remote-code. Use the official vLLM docker image and mount your Hugging Face cache for faster startup. Example Docker command:

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:minimax27 MiniMaxAI/MiniMax-M2.5 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --tool-call-parser minimax_m2 \
  --reasoning-parser minimax_m2 \
  --enable-auto-tool-choice \
  --compilation-config '{"mode":3,"pass_config":{"fuse_minimax_qk_norm":true}}' \
  --trust-remote-code

According to the MiniMax-M2 Usage Guide, this setup is optimized for H100/H200 and ensures high throughput and low latency.

Would you like more detail on tuning batch size, context length, or advanced parallelism?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Serving minimax-m2 Model Support	3	487	November 8, 2025
The service carrying capacity and computing efficiency issues of vllm General	0	42	July 16, 2025
Is it possible to deploy minimax-m2 using 2A100 and 4A10? General	1	126	December 22, 2025
How to benchmark concurrency General	26	1794	September 15, 2025
vLLM does not work with 2x 5090 in tp 2 General	8	953	September 18, 2025

Support for MiniMax-M2.5

Related topics