Compressed Multimodal embeddings inputs

vLLM supports multimodal inputs through the multi_modal_data argument. This works well but the argument tends to be quite large. For QWEN3VL-4B with a photo of 384x384, we are talking about 2MB payload (500k floats).

One idea would be to send a compressed payload (a few KB) and provide f , a decoder, to decompress it doing this:

outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": f(image_embeds)},
})

Can vLLM integrate any arbitrary f ? One example would be a torch model decoding from codes (e.g. RQ-VAE style).

Any recommendation welcomed.

vLLM does not natively support passing an arbitrary decoder function f (such as a torch model for decompressing embeddings) as part of the multi_modal_data pipeline. The multi_modal_data field expects pre-processed data in the correct format (e.g., image tensors or embeddings), and there is no documented API or mechanism to register or execute a custom decompression/decoding function server-side during inference. You must decompress or decode your data before passing it to vLLM, as vLLM expects the input to already be in the required tensor format for the model (multimodal inputs docs, embedding inputs).

If you want to reduce payload size, you would need to handle decompression or decoding outside of vLLM, possibly by running a pre-processing service that expands compressed codes into the expected tensor format before sending the request to vLLM. There is no built-in support for server-side arbitrary decoding functions in the current vLLM API (RFC: Unified Input Formatting and Processing via Renderer).

Would you like more detail on possible workarounds or architectural suggestions?

Sources:

Hint: Mention @RunLLM in the post for followups.