Compressed Multimodal embeddings inputs

guillaume.guy · March 18, 2026, 5:49pm

vLLM supports multimodal inputs through the multi_modal_data argument. This works well but the argument tends to be quite large. For QWEN3VL-4B with a photo of 384x384, we are talking about 2MB payload (500k floats).

One idea would be to send a compressed payload (a few KB) and provide f , a decoder, to decompress it doing this:

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": f(image_embeds)},
})

Can vLLM integrate any arbitrary f ? One example would be a torch model decoding from codes (e.g. RQ-VAE style).

Any recommendation welcomed.

RunLLM · March 18, 2026, 5:49pm

vLLM does not natively support passing an arbitrary decoder function f (such as a torch model for decompressing embeddings) as part of the multi_modal_data pipeline. The multi_modal_data field expects pre-processed data in the correct format (e.g., image tensors or embeddings), and there is no documented API or mechanism to register or execute a custom decompression/decoding function server-side during inference. You must decompress or decode your data before passing it to vLLM, as vLLM expects the input to already be in the required tensor format for the model (multimodal inputs docs, embedding inputs).

If you want to reduce payload size, you would need to handle decompression or decoding outside of vLLM, possibly by running a pre-processing service that expands compressed codes into the expected tensor format before sending the request to vLLM. There is no built-in support for server-side arbitrary decoding functions in the current vLLM API (RFC: Unified Input Formatting and Processing via Renderer).

Would you like more detail on possible workarounds or architectural suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How does vllm process multimodal embedding requests General	8	91	May 7, 2026
Issue: Unable to pass precomputed image embeddings to vLLM General	12	405	January 14, 2026
How do I precompute multimodal embeddings? Multi-modality	5	300	February 2, 2026
Custom modality Multi-modality	3	67	November 14, 2025
Multimodal inference guideline? General	59	2796	August 6, 2025

Compressed Multimodal embeddings inputs

Related topics