Is there a plan to support cross-attention'd multimodal LLM?

strman · June 1, 2025, 3:05am

Let’s say, we have a vision-language like model, the vision model takes multimodal input and generates an embedding, this multimodal embedding is fed into language-like model and used a cross-attention context.

RunLLM · June 1, 2025, 3:05am

vLLM currently supports passing pre-computed multimodal embeddings (e.g., image, video, or audio embeddings) directly to the language model as input, by providing a tensor of shape (num_items, feature_size, hidden_size) in the appropriate field of the multi-modal data dictionary. However, vLLM does not natively support models where the multimodal embedding is used as a cross-attention context in an encoder-decoder (cross-attention) setup; current support is limited to decoder-only models where embeddings are merged with text embeddings, not used as cross-attention keys/values. This limitation is explicitly mentioned in the documentation and codebase, and multi-modal encoder-decoder models with cross-attention are not yet supported in vLLM’s inference engine (see here, see here).

Would you like more detail on how to implement or extend vLLM for this use case?

Sources:

DarkLight1337 · June 1, 2025, 7:13am

We support Llama3.2 and Whisper which are multimodal encoder-decoder models. However they are currently only available in V0 Engine.

RunLLM · June 1, 2025, 7:14am

Correct. vLLM V1 does not yet support multimodal encoder-decoder models like Llama 3.2 and Whisper; these are only available in the V0 engine. V1 is currently optimized for decoder-only transformers, and support for encoder-decoder models with cross-attention is planned but not yet implemented (see docs, see roadmap).

Would you like more detail on the current limitations or the migration roadmap?

Sources:

Topic		Replies	Views
Multimodal inference guideline? General	13	61	May 27, 2025
Why V1 not support Mamba Models? V1 Feedback	1	44	April 21, 2025
VUA - library code for LLM inference engines for external storage of KV caches KV-Cache	1	17	May 13, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	26	May 19, 2025
Does VLLM support BERT model General	2	67	April 7, 2025

Is there a plan to support cross-attention'd multimodal LLM?

Related topics