I want to efficiently generate embeddings with Qwen3-VL-Embedding-8B for a single image using multiple different text instructions. Embedding generation is relatively slow because the model expects instructions in the system prompt (while the image is passed in the user prompt).
Since the system prompt changes with each request, the embedding generation process does not benefit from prefix caching. Instead, each embedding request reruns the computationally expensive vision encoder despite processing the same image (I think).
I hope that precomputing and then reusing the vision encoder embeddings for the image and then passing these vision embeddings with every subsequent embedding request might avoid the repetitive vision encoder processing. The feature to pass image embeddings is documented here: Multimodal Inputs - vLLM
I want to accomplish something like the code below, except I don’t understand how to precompute the image_embeds and the image_grid_thw.
from vllm.utils.serial_utils import tensor2base64
from typing import Any
import requests
instructions = [
"Represent the colors of the image.",
"Represent the shapes of the image.",
"Represent the moods of the image.",
]
# Create the precomputed image embeddings - but how?
embeds = {
"type": "image_embeds",
"image_embeds": {
"image_embeds": tensor2base64(torch.zeros(220, 4096)), # Shape: (image_feature_size, hidden_size)
"image_grid_thw": tensor2base64(torch.tensor([1, 22, 40])), # Shape: (3,)
},
}
# Create messages object by passing image embeddings
def create_messages(instr: str, embeds: dict[str, Any]) -> list[dict[str, Any]]:
messages=[
{
"role": "system",
"content": instr,
},
{
"role": "user",
"content": [
{"type": "text","text": ""},
embeds,
],
},
{
"role": "assistant",
"content": [{"type": "text", "text": ""}],
},
]
return messages
model = "Qwen/Qwen3-VL-Embedding-8B"
url = "http://localhost:8000/v1/embeddings"
# Generate embeddings
for instr in instructions:
payload = {
"messages": create_messages(instr=instr, embeds=embeds),
"model": model,
}
response = requests.post(f"{url}", json=payload)
print(f"Status: {response.status_code}")
I am using online inference (vllm serve ...) so I would love to see a solution to precomputing image embeddings with either the /embeddings or the /pooling endpoint. Is that possible?