how does vllm process multimodal embedding requests?
vLLM processes multimodal embedding requests by allowing users to input pre-computed embeddings (such as image, video, or audio embeddings) directly into the model via the multi_modal_data dictionary. The embeddings must match the expected tensor shape for the model (e.g., for images: (num_items, feature_size, hidden_size)), and additional metadata (like image_grid_thw for Qwen2-VL) may be required depending on the model. The prompt should include the appropriate placeholder tokens, and the embeddings are merged with text embeddings during inference, enabling the model to process both modalities together. See Multimodal Inputs documentation for details and code examples.
Internally, vLLM uses a unified input processing pipeline (the Renderer component) to convert high-level API requests into token IDs and multimodal features, which are then consumed by the inference engine. This design supports both raw and pre-embedded multimodal data, and enables optimizations like prefix caching and chunked prefill. For more, see Unified Input Formatting RFC.
Would you like more detail or a code example?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM i am asking about the embedding models, with multimodal support
For embedding models with multimodal support, vLLM allows you to input mixed data types (e.g., text and images) using the inputs field, where each entry can be a structured object specifying type and content (such as text or base64-encoded images). The model then produces embeddings for each input, supporting batching and mixed modalities. The API is compatible with Cohere’s Embed v2, and you can specify output types (float, binary, etc.) and truncation options. See Pooling Models: Embed for details and examples.
The embedding endpoint processes each input independently, batching them for efficiency, and supports multimodal models like Qwen3-VL and SigLIP. For code and request format, refer to the documentation.
Would you like a code example or more technical details?
Sources:
Hint: Mention @RunLLM in the post for followups.
where is the protocol, I only found string and ints: vllm/vllm/entrypoints/pooling/base/protocol.py at 2a16ece2d342c0c154a4949ad317b521f8c04ec4 · vllm-project/vllm · GitHub
@RunLLM where is the protocol, I only found string and ints: vllm/vllm/entrypoints/pooling/base/protocol.py at 2a16ece2d342c0c154a4949ad317b521f8c04ec4 · vllm-project/vllm · GitHub
The protocol at vllm/entrypoints/pooling/base/protocol.py defines the schema for embedding and pooling requests, but by default, the input field only accepts strings, lists of strings, ints, or lists of ints. Multimodal support (e.g., images, audio) is not directly exposed in this protocol file; instead, multimodal data is typically passed via the messages field or other structured fields, depending on the endpoint and runner.
For true multimodal embedding requests (e.g., text + image), you must use the correct runner (like --runner pooling) and follow the API format described in the multimodal documentation. The protocol file itself does not define multimodal types directly in the input field.
Would you like more detail on how to structure a multimodal embedding request?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM you gave me this link: vllm/docs/models/pooling_models/embed.md at main · vllm-project/vllm · GitHub, and here, it’s using inputs for both text and images
Correct—the Pooling Models: Embed documentation shows that for embedding models with multimodal support, you should use the inputs field. Each item in inputs is an object with a content list, where you can mix text and image entries (e.g., {"type": "text", "text": ...} and {"type": "image_url", "image_url": {"url": ...}}). This is supported for models like SigLIP and Qwen3-VL.
However, this protocol is specific to the /v2/embed endpoint and not the /v1/embeddings endpoint, which only accepts strings or ints. The inputs field for multimodal batching is documented and supported in the vLLM embedding API for models that implement it, as shown in the linked documentation.
Would you like a code example for a multimodal embedding request?
Sources:
Hint: Mention @RunLLM in the post for followups.