Whats is the hybrid model
A hybrid model in vLLM refers to a neural network architecture that combines both attention-based layers (like Transformers) and state-space model layers (such as Mamba). These models are designed to leverage the strengths of both mechanisms within a single network, often alternating or mixing the two types of layers. Examples include models like Jamba, Bamba, Nemotron-H, GraniteMoeHybrid, and FalconH1, all of which are supported in vLLM and implement the IsHybrid interface, indicating the presence of both attention and Mamba (or similar) blocks in their configuration (interfaces documentation, Bamba, Nemotron-H, GraniteMoeHybrid, FalconH1).
Would you like more detail on how hybrid models are implemented or used in vLLM?
Sources:
Hint: Mention @RunLLM in the post for followups.