How do I precompute multimodal embeddings?

I want to efficiently generate embeddings with Qwen3-VL-Embedding-8B for a single image using multiple different text instructions. Embedding generation is relatively slow because the model expects instructions in the system prompt (while the image is passed in the user prompt).

Since the system prompt changes with each request, the embedding generation process does not benefit from prefix caching. Instead, each embedding request reruns the computationally expensive vision encoder despite processing the same image (I think).

I hope that precomputing and then reusing the vision encoder embeddings for the image and then passing these vision embeddings with every subsequent embedding request might avoid the repetitive vision encoder processing. The feature to pass image embeddings is documented here: Multimodal Inputs - vLLM

I want to accomplish something like the code below, except I don’t understand how to precompute the image_embeds and the image_grid_thw.

from vllm.utils.serial_utils import tensor2base64
from typing import Any
import requests

instructions = [
    "Represent the colors of the image.",
    "Represent the shapes of the image.",
    "Represent the moods of the image.",
]

# Create the precomputed image embeddings - but how?
embeds = {
    "type": "image_embeds",
    "image_embeds": {
        "image_embeds": tensor2base64(torch.zeros(220, 4096)),  # Shape: (image_feature_size, hidden_size)
        "image_grid_thw": tensor2base64(torch.tensor([1, 22, 40])),  # Shape: (3,)
    },
}

# Create messages object by passing image embeddings
def create_messages(instr: str, embeds: dict[str, Any]) -> list[dict[str, Any]]:
    messages=[
        {
            "role": "system",
            "content": instr,
        },
        {
            "role": "user",
            "content": [
                {"type": "text","text": ""},
                embeds,
            ],
        },
        {
            "role": "assistant",
            "content": [{"type": "text", "text": ""}],
        },
    ]

    return messages

model = "Qwen/Qwen3-VL-Embedding-8B"
url = "http://localhost:8000/v1/embeddings"

# Generate embeddings
for instr in instructions:
    payload = {
        "messages": create_messages(instr=instr, embeds=embeds),
        "model": model,
    }

    response = requests.post(f"{url}", json=payload)
    print(f"Status: {response.status_code}")

I am using online inference (vllm serve ...) so I would love to see a solution to precomputing image embeddings with either the /embeddings or the /pooling endpoint. Is that possible?

You can precompute image embeddings and image_grid_thw for Qwen3-VL-Embedding-8B by running the image through the model’s vision encoder offline, then passing these as base64-encoded tensors in subsequent embedding requests. This avoids redundant vision encoder computation for each instruction. The recommended approach is to use the model’s processor and vision encoder to extract the embeddings and grid_thw, then serialize them as shown in the official vLLM documentation and example scripts.

Here is a code outline for precomputing the embeddings:

from transformers import AutoProcessor, AutoModel
import torch
from PIL import Image
from vllm.utils.serial_utils import tensor2base64

model_name = "Qwen/Qwen3-VL-Embedding-8B"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name).eval().cuda()

image = Image.open("your_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt").to("cuda")
with torch.no_grad():
    # Replace with the correct method for your model to get image_embeds and grid_thw
    image_embeds, deepstack_features = model.visual(inputs['pixel_values'], grid_thw=inputs["image_grid_thw"])
    # For Qwen3-VL, you may need to concatenate deepstack features as well
    hidden_states = torch.cat([image_embeds] + deepstack_features, dim=1)
    image_grid_thw = inputs["image_grid_thw"]

base64_image_embeds = tensor2base64(hidden_states.cpu())
base64_image_grid_thw = tensor2base64(image_grid_thw.cpu())

You can then use these base64 strings in your embedding requests as shown in your example. This method is supported and documented for online inference in vLLM, and is the recommended way to avoid redundant vision encoder computation for repeated instructions on the same image (see docs, see example). Would you like a more detailed step-by-step code example?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM This example is helpful but precomputing the embeddings offline does not work for our use case because our vLLM server resides on a remote host. We cannot run transformers on our production infrastructure so we are limited to using vLLM’s online inference API.

Yes, please provide a detailed example that demonstrates how to create the image_embeds and the deepstack_features using only online inference. Please use either the /embeddings or the /pooling endpoint in your example.

You cannot directly obtain precomputed image_embeds and deepstack_features using only the vLLM online inference API (/embeddings or /pooling) as there is no endpoint or documented API in vLLM that returns the raw vision encoder outputs (image_embeds, image_grid_thw, or deepstack_features) via online inference. The online API only returns the final embedding vector, not the intermediate vision features required for your use case. This limitation is explicitly noted in the vLLM documentation and examples, which show precomputing these features offline using HuggingFace Transformers, not via the vLLM server itself (see docs, see example).

Therefore, with the current vLLM online API, it is not possible to extract and reuse image_embeds or deepstack_features for repeated requests. Each /embeddings or /pooling call with an image will re-run the vision encoder. If you need this feature, you would need to request it as a new API feature in vLLM or run the vision encoder offline as shown in the documentation.

Would you like more detail or suggestions for possible workarounds?

Sources:

Hint: Mention @RunLLM in the post for followups.

The vision encoder’s output is cached in the encoder feature cache separate from prefix caching. So even if prefix caching doesn’t work and the engine tries to get the vision embeddings, it can still use the encoder feature cache to get cached embeddings.

@DarkLight1337 Thank you! That info is helpful and prevented us from going down a dead-end rabbit hole. Our vLLM logs match your statement: We have low prefix cache hit rates (because of the changing system prompts) but high multi-modal cache hit rates, so it looks like our inference pipeline is already as efficient as our constraints allow.