(APIServer pid=22031) WARNING 03-17 22:16:08 [config.py:381] Mamba cache mode is set to ‘align’ for Qwen3_5ForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=22031) INFO 03-17 22:16:08 [config.py:401] Warning: Prefix caching in Mamba cache ‘align’ mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe. in here,Whats is Mamba?
Mamba is a type of neural network layer used in some large language models (LLMs) as an alternative or complement to traditional attention mechanisms. Specifically, Mamba layers implement state space models (SSMs) that can efficiently model long-range dependencies in sequences, and are often used in “hybrid” architectures alongside attention layers. In vLLM, Mamba layers require special handling for caching and prefix reuse, which is why you see experimental support warnings when enabling prefix caching with Mamba-based models like Qwen3.5—for these, the “align” cache mode is used by default when prefix caching is enabled, but support is still under development and may not be fully stable or optimized yet. See MambaBase documentation and Qwen3.5 configuration tips for more details.
Would you like a deeper technical explanation of how Mamba layers work or how they interact with caching in vLLM?
Sources:
- https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/mamba/abstract/
- Qwen3.5 configuration tips
Hint: Mention @RunLLM in the post for followups.
is GLM5 include mamba layer?