Minimax m3 support

Hi! MiniMax just released M3 (API is live, open weights coming in ~10 days).

The catch is that M3 moves away from full attention to a new mechanism they call
MSA (MiniMax Sparse Attention) — block-level sparse selection, but still on a GQA
backbone and on the real uncompressed K/V (not MLA-style latent compression).

vLLM gave day-0 support for the M2 series since those were full-attention GQA and
mapped onto the existing FlashAttention kernels. M3 is different because MSA needs
the block-selection step plus their “KV outer gather Q” sparse kernel.
(MiniMax M3: Frontier Coding, 1M Context, Native Multimodality — All in One Model - MiniMax Research | MiniMax)

So mostly wondering:

  • Is M3/MSA on the roadmap? Any tracking issue or PR yet?
  • Rough timeline once the weights drop?
  • Since it stays on GQA, do you expect it to reuse the existing backend with a
    selection pre-pass, or need a dedicated one?

Happy to help test on H200 (incl. long-context KV footprint) once weights are out.
Thanks for all the work on vLLM!

MiniMax M3’s MSA (MiniMax Sparse Attention) is not yet supported in vLLM, but enabling sparse/block attention is an active roadmap item. The vLLM team has discussed the need for per-role and backend-specific attention configuration to support new sparse mechanisms like MSA, and recent RFCs and design docs explicitly mention opening the path for sparse attention backends, including block-level sparse selection and backend-specific parameters. However, there is no public tracking issue or merged PR specifically for M3/MSA as of now, and support will likely require a dedicated backend rather than just a pre-pass on existing GQA kernels, due to the unique “KV outer gather Q” pattern and block selection logic required by MSA (MiniMax M3 blog, vLLM-Omni RFC, vLLM Roadmap Q2 2025).

The vLLM team is actively working on infrastructure for sparse attention and hybrid memory allocation, which are prerequisites for MSA support. Once M3 weights are released, expect rapid progress, but a new backend implementation will likely be needed for full performance and correctness. Would you like a step-by-step breakdown of the technical requirements and current design proposals for sparse attention in vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.