Serving minimax-m2

t.white · November 6, 2025, 6:55pm

I downloaded minimax-m2 from huggingface to run offline. I followed the minimax m2 recipe and cannot get it to work. I ran

uv venv
source .venv/bin/activate
uv pip install 'triton-kernels @ git+https://github.com/triton-lang/triton.git@v3.5.0#subdirectory=python/triton_kernels' \
   vllm --extra-index-url https://wheels.vllm.ai/nightly --prerelease=allow

and sourced the .venv and tried to run the model.

SAFETENSORS_FAST_GPU=1 vllm serve /home/twhite/MODELS/MiniMaxAI/MiniMax-M2 --enable_expert_parallel --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser minimax_m2 --reasoning-parser minimax_m2_append_think
INFO 11-06 12:05:27 [init.py:216] Automatically detected platform cuda.
usage: vllm serve [model_tag] [options]
vllm serve: error: argument --reasoning-parser: invalid choice: ‘minimax_m2_append_think’ (choose from ‘deepseek_r1’, ‘glm45’, ‘openai_gptoss’, ‘granite’, ‘hunyuan_a13b’, ‘mistral’, ‘qwen3’, ‘seed_oss’, ‘step3’)

I am running 8 L40s so hence the command i am using.

What am I doing wrong?

RunLLM · November 6, 2025, 6:55pm

The error occurs because your installed vLLM version does not recognize minimax_m2_append_think as a valid --reasoning-parser option. This means your vLLM is outdated or not built from a recent enough nightly that includes MiniMax-M2 support. The official MiniMax-M2 usage guide requires a vLLM version where minimax_m2_append_think is available as a reasoning parser.

To fix this, ensure you install the latest vLLM nightly build from the correct index. Use:

uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly --extra-index-url https://download.pytorch.org/whl/cu129 --index-strategy unsafe-best-match

Then retry your serve command. This should add minimax_m2_append_think as a valid option. If you still see the error, verify your vLLM version and that the install succeeded. Would you like a step-by-step troubleshooting guide?

Sources:

MiniMax-M2 Usage Guide - vLLM Recipes

Hint: Mention @RunLLM in the post for followups.

t.white · November 6, 2025, 8:04pm

Thank you, that works. But that command is very different than the one on your minimax-m2 recipe.

karthikksamy · November 8, 2025, 5:21am

Is it required to have 4xH100 GPUs. When I run on 2xH100, I am getting

[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 11-08 05:16:02 [parallel_state.py:1325] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 11-08 05:16:02 [parallel_state.py:1325] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
(Worker_TP0 pid=2134538) INFO 11-08 05:16:03 [gpu_model_runner.py:2944] Starting to load model MiniMaxAI/MiniMax-M2…
(Worker_TP1 pid=2134539) INFO 11-08 05:16:03 [cuda.py:409] Using Flash Attention backend.
(Worker_TP1 pid=2134539) WARNING 11-08 05:16:03 [fp8.py:165] DeepGEMM backend requested but not available.
(Worker_TP1 pid=2134539) INFO 11-08 05:16:03 [fp8.py:180] Using Triton backend for FP8 MoE
(Worker_TP0 pid=2134538) INFO 11-08 05:16:03 [cuda.py:409] Using Flash Attention backend.
(Worker_TP0 pid=2134538) WARNING 11-08 05:16:03 [fp8.py:165] DeepGEMM backend requested but not available.
(Worker_TP0 pid=2134538) INFO 11-08 05:16:03 [fp8.py:180] Using Triton backend for FP8 MoE
(Worker_TP1 pid=2134539) ERROR 11-08 05:16:04 [multiproc_executor.py:646] WorkerProc failed to start.

(Worker_TP1 pid=1970339) ERROR 11-08 03:38:24 [multiproc_executor.py:646] WorkerProc failed to start.
(Worker_TP1 pid=1970339) ERROR 11-08 03:38:24 [multiproc_executor.py:646] Traceback (most recent call last):
(Worker_TP1 pid=1970339) ERROR 11-08 03:38:24 [multiproc_executor.py:646] File “/home/ubuntu/workspace/karthik/minimax-m2/.venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 620, in worker_main
(Worker_TP1 pid=1970339) ERROR 11-08 03:38:24 [multiproc_executor.py:646] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=1970339) ERROR 11-08 03:38:24 [multiproc_executor.py:646] File “/home/ubuntu/workspace/karthik/minimax-m2/.venv/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py”, line 475, in init

resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ’

@RunLLM

Topic		Replies	Views
Support for MiniMax-M2.5 Model Support	1	77	April 14, 2026
"served-model-name" and "model" General	6	915	January 26, 2026
torch.OutOfMemoryError: CUDA out of memory General	15	1247	March 3, 2026
MiniMax-M2.1 输出乱码有人遇到吗应该不是vllm的问题 General	2	124	December 30, 2025
Gibberish output from NVFP4 quantized Ministral on VLLM 0.19.2rc1.dev205+g07351e088 General	1	50	April 27, 2026

Serving minimax-m2

Related topics