Why V1 not support Mamba Models?

Hi team,
While reading the documentation, I noticed that V1 currently does not support Mamba models due to the use of non-standard attention mechanisms, such as selective state-space mechanisms. I’m curious — what would be the main technical challenges or limitations in adapting Mamba models (e.g., MambaForCausalLM, JambaForCausalLM) to be compatible with V1?

Here is the excerpt from the documentation:

Mamba Models
Models using selective state-space mechanisms (instead of standard transformer attention) are not yet supported (e.g., MambaForCausalLM, JambaForCausalLM).

Thanks in advance for your insights!

It seems likely that the main technical challenges in adapting Mamba models to be compatible with vLLM V1 stem from their use of selective state-space mechanisms, which differ from the standard transformer attention models that vLLM V1 is optimized for. These models, such as MambaForCausalLM and JambaForCausalLM, require specialized handling of their internal states and cache management, which is not yet fully supported in V1. The current architecture of vLLM V1 is designed around decoder-only transformers, and models with non-standard attention mechanisms may require significant architectural changes to accommodate their unique requirements.

My understanding is that the vLLM team is working on optimizing and supporting a broader range of models, including those with hybrid or non-standard attention mechanisms, in future updates. This involves developing new memory management strategies and cache handling techniques that can efficiently support the unique characteristics of these models. For more details, you might want to check the vLLM documentation and related GitHub issues discussing the ongoing development efforts.

Would you like more detail on any specific aspect?

Sources: