I have 2 *A100 80G and 4 A10 on 2 server. I want to deploy minimax-m2 (awq).
Any suggested way to deploy it?
Gemini suggested to use PD disaggregation but it seems that 4* * A10 cannot hold all the weights.
I have 2 *A100 80G and 4 A10 on 2 server. I want to deploy minimax-m2 (awq).
Any suggested way to deploy it?
Gemini suggested to use PD disaggregation but it seems that 4* * A10 cannot hold all the weights.
You cannot deploy MiniMax-M2 (AWQ) on 4×A10 GPUs if the total GPU memory is insufficient to hold all model weights, even with pipeline/data (PD) disaggregation. vLLM requires that the combined GPU memory across all devices is enough for the model, and there is no documented support for offloading MiniMax-M2 weights to CPU or disk to bypass this limit. PD disaggregation is mainly for separating prefill and decode workloads, not for splitting model weights beyond available GPU memory. According to the MiniMax-M2 usage guide, deployment examples use 4×A100/A800/H200/H20, but do not mention support for heterogeneous or memory-insufficient clusters.
If you want to maximize your available hardware, you could deploy MiniMax-M2 on the 2×A100 80G (which is sufficient for the model), and use the 4×A10s for other, smaller models. There is no official documentation or example for combining A100s and A10s in a single vLLM deployment for MiniMax-M2, nor for using CPU offload to make up for insufficient GPU memory for this model. Would you like more detail on PD disaggregation or alternative deployment strategies?
Sources:
Hint: Mention @RunLLM in the post for followups.