Let’s say, we have a vision-language like model, the vision model takes multimodal input and generates an embedding, this multimodal embedding is fed into language-like model and used a cross-attention context.
vLLM currently supports passing pre-computed multimodal embeddings (e.g., image, video, or audio embeddings) directly to the language model as input, by providing a tensor of shape (num_items, feature_size, hidden_size) in the appropriate field of the multi-modal data dictionary. However, vLLM does not natively support models where the multimodal embedding is used as a cross-attention context in an encoder-decoder (cross-attention) setup; current support is limited to decoder-only models where embeddings are merged with text embeddings, not used as cross-attention keys/values. This limitation is explicitly mentioned in the documentation and codebase, and multi-modal encoder-decoder models with cross-attention are not yet supported in vLLM’s inference engine (see here, see here).
Would you like more detail on how to implement or extend vLLM for this use case?
Sources:
We support Llama3.2 and Whisper which are multimodal encoder-decoder models. However they are currently only available in V0 Engine.
Correct. vLLM V1 does not yet support multimodal encoder-decoder models like Llama 3.2 and Whisper; these are only available in the V0 engine. V1 is currently optimized for decoder-only transformers, and support for encoder-decoder models with cross-attention is planned but not yet implemented (see docs, see roadmap).
Would you like more detail on the current limitations or the migration roadmap?
Sources: