Issue: Unable to pass precomputed image embeddings to vLLM

doanhoang1508 · January 12, 2026, 7:07am

Hi, I’m trying to use precomputed image embeddings with vLLM following the documentation here: https://docs.vllm.ai/en/stable/features/multimodal_inputs/#image-embedding-inputs.
According to the docs, vLLM accepts:

A single embedding as a 3D tensor of shape (1, feature_size, hidden_size)
Or multiple embeddings as a list of 2D tensors, each (feature_size, hidden_size)

However, when I send these embeddings (even when stacked to (2, 416, 1024) for two images), vLLM throws:

ValueError: image_embeds has rank 3 but expected 2

This is my code and how i start the server:

import torch
import base64
import io
from openai import OpenAI
from vllm.utils.serial_utils import tensor2base64

def run_multi_image_inference():
client = OpenAI(
base_url=“http://localhost:8001/v1”,
api_key=“EMPTY”,
)

# 1. Prepare your multimodal data
# Real image embeddings from Qwen-VL vision encoder usually have 416 tokens
img1_embed = torch.randn(416, 1024) 
img1_thw = torch.tensor([1, 32, 52]) 

img2_embed = torch.randn(416, 1024)
img2_thw = torch.tensor([1, 32, 52]) 

# 2. Build the messages list
# Note: Interleave the image_embeds blocks with the text prompt
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_embeds",
                "image_embeds": {
                    "image_embeds": tensor2base64(img1_embed),
                    "image_grid_thw": tensor2base64(img1_thw),
                }
            },
            {
                "type": "image_embeds",
                "image_embeds": {
                    "image_embeds": tensor2base64(img2_embed),
                    "image_grid_thw": tensor2base64(img2_thw),
                }
            },
            {
                "type": "text", 
                "text": "Compare these two images and tell me the differences."
            }
        ]
    }
]

# 3. Request completion
try:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-VL-2B-Instruct", # Use exact name from your server
        messages=messages,
        max_tokens=128,
        temperature=0.2
    )
    print("\n--- Model Response ---\n")
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Inference failed: {e}")

if name == “main”:
run_multi_image_inference()

vllm serve Qwen/Qwen3-VL-2B-Instruct --port 8001 --enable-mm-embeds --limit-mm-per-prompt ‘{“image”: 2}’ --max-model-len 8192 --trust-remote-code

Is there a mismatch between:

The documentation
vs
The actual implementation of the multimodal embedding parser?

Or is there an additional constraint on how multi-image embeddings must be packaged that isn’t documented?

RunLLM · January 12, 2026, 7:08am

Yes, there is a subtle but important constraint: for multi-image embedding input via the OpenAI API, vLLM currently only supports one message with type “image_embeds” per request. The documentation states that “multiple messages can now contain {‘type’: ‘image_embeds’},” but in practice, only one message with image embeddings is supported per request, and the embeddings for multiple images must be packed together as a single tensor or dict, not as multiple separate “image_embeds” blocks in the message content. If you send multiple “image_embeds” blocks, vLLM will not merge them and will raise a shape or type error (see docs, see discussion, see issue).

To pass multiple image embeddings, you should stack them into a single tensor of shape (num_images, feature_size, hidden_size) and send them as a single “image_embeds” block, e.g.:

embeds = {
    "type": "image_embeds",
    "image_embeds": {
        "image_embeds": tensor2base64(torch.stack([img1_embed, img2_embed])),
        "image_grid_thw": tensor2base64(torch.stack([img1_thw, img2_thw])),
    }
}
# Then use only one such block in your message content.

This matches the expected input for Qwen2/3-VL and avoids the ValueError (see docs, see code example). Would you like a step-by-step code example for this?

Sources:

Hint: Mention @RunLLM in the post for followups.

doanhoang1508 · January 12, 2026, 7:14am

@RunLLM Thank you — that constraint makes sense. However, I have already tried stacking all image embeddings into a single image_embeds block (as a (num_images, F, H) tensor for image_embeds and (num_images, 3) for image_grid_thw), and I still get an internal error from vLLM:

Inference failed: Error code: 500 - {‘error’: {‘message’: ‘list index out of range’, ‘type’: ‘Internal Server Error’, ‘param’: None, ‘code’: 500}}

DarkLight1337 · January 12, 2026, 7:44am

You can try this code: [Bug]: When using image_embeds, ImageProcessorItems are used instead of ImageEmbeddingItems, causing an out-of-bounds array error. · Issue #31708 · vllm-project/vllm · GitHub

doanhoang1508 · January 12, 2026, 8:02am

@DarkLight1337 I tried the exact sample code from this comment but I still hit a fatal error inside vLLM when running Qwen3-VL-2B-Instruct with image embeddings.

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024
(input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]

(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.13.0) with config: model=‘Qwen/Qwen3-VL-2B-Instruct’, speculative_config=None, tokenizer=‘Qwen/Qwen3-VL-2B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=Qwen/Qwen3-VL-2B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/home/hoangtd/.cache/vllm/torch_compile_cache/99725bf3c4’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘eliminate_noops’: True, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False}, ‘local_cache_dir’: ‘/home/hoangtd/.cache/vllm/torch_compile_cache/99725bf3c4/rank_0_0/backbone’},
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-9973e15133eb53c3,prompt_token_ids_len=428,mm_features=[MultiModalFeatureSpec(data={‘image_grid_thw’: MultiModalFieldElem(modality=‘image’, key=‘image_grid_thw’, data=tensor([ 1, 32, 52]), field=MultiModalBatchedField(keep_on_cpu=True)), ‘image_embeds’: MultiModalFieldElem(modality=‘image’, key=‘image_embeds’, data=tensor([[0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] …,
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.]]), field=MultiModalFlatField(keep_on_cpu=False, slices=[[slice(0, 1664, None)]], dim=0))}, modality=‘image’, identifier=‘422d83049ac831542d162035c97b33b27331b51dd9c66cc31d8ad99a848f2f62’, mm_position=PlaceholderRange(offset=4, length=416, is_embed=None))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=, stop_token_ids=[151643], bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=8179, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=, resumed_req_ids=, new_token_ids=, all_token_ids={}, new_block_ids=, num_computed_tokens=, num_output_tokens=), num_scheduled_tokens={chatcmpl-9973e15133eb53c3: 428}, total_num_scheduled_tokens=428, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-9973e15133eb53c3: [0]}, num_common_prefix_blocks=[27], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0030477480528275924, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=428, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] Traceback (most recent call last):
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 859, in run_engine_core
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] engine_core.run_busy_loop()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 886, in run_busy_loop
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] self._process_engine_step()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 919, in _process_engine_step
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 351, in step
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] model_output = future.result()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 451, in result
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return self.__get_result()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] raise self._exception
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py”, line 79, in collective_rpc
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 461, in run_method
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 369, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 623, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] output = self.model_runner.execute_model(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3060, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = self._preprocess(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 2471, in _preprocess
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] inputs_embeds_scheduled = self.model.embed_input_ids(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1992, in embed_input_ids
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = self._compute_deepstack_embeds(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1939, in _compute_deepstack_embeds
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = torch.split(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/functional.py”, line 173, in split
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return tensor.split(split_size_or_sections, dim)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/_tensor.py”, line 1030, in split
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return torch._VF.split_with_sizes(self, split_size, dim)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024 (input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]
(EngineCore_DP0 pid=2053438) Process EngineCore_DP0:
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] AsyncLLM output_handler failed.
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] Traceback (most recent call last):
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py”, line 490, in output_handler
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] outputs = await engine_core.get_output_async()
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 895, in get_output_async
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] raise self._format_exception(outputs) from None
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=2053438) Traceback (most recent call last):
(APIServer pid=2053237) INFO: 127.0.0.1:14270 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=2053438) self.run()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=2053438) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 870, in run_engine_core
(EngineCore_DP0 pid=2053438) raise e
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 859, in run_engine_core
(EngineCore_DP0 pid=2053438) engine_core.run_busy_loop()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 886, in run_busy_loop
(EngineCore_DP0 pid=2053438) self._process_engine_step()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 919, in _process_engine_step
(EngineCore_DP0 pid=2053438) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 351, in step
(EngineCore_DP0 pid=2053438) model_output = future.result()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 451, in result
(EngineCore_DP0 pid=2053438) return self.__get_result()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
(EngineCore_DP0 pid=2053438) raise self._exception
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py”, line 79, in collective_rpc
(EngineCore_DP0 pid=2053438) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 461, in run_method
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 369, in execute_model
(EngineCore_DP0 pid=2053438) return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 623, in execute_model
(EngineCore_DP0 pid=2053438) output = self.model_runner.execute_model(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3060, in execute_model
(EngineCore_DP0 pid=2053438) ) = self._preprocess(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 2471, in _preprocess
(EngineCore_DP0 pid=2053438) inputs_embeds_scheduled = self.model.embed_input_ids(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1992, in embed_input_ids
(EngineCore_DP0 pid=2053438) ) = self._compute_deepstack_embeds(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1939, in _compute_deepstack_embeds
(EngineCore_DP0 pid=2053438) ) = torch.split(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/functional.py”, line 173, in split
(EngineCore_DP0 pid=2053438) return tensor.split(split_size_or_sections, dim)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/_tensor.py”, line 1030, in split
(EngineCore_DP0 pid=2053438) return torch._VF.split_with_sizes(self, split_size, dim)
(EngineCore_DP0 pid=2053438) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024 (input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]

DarkLight1337 · January 12, 2026, 8:16am

Likely the shape of the embeddings is different for this model. I only tested that code on Qwen2.5-VL.

doanhoang1508 · January 12, 2026, 8:44am

Thanks — I’ve updated my code to match Qwen3-VL’s embedding structure, and it now works correctly for a single image embedding. So the embedding format itself is not the issue.
However, the problem appears when using multiple image embeddings.I tested both formats: (1) the structure shown in the documentation, which fails with a 400 error: size_per_item should be a 1-D tensor, but found shape: torch.Size([2, 1]); and (2) stacking multiple images with torch.stack(), which avoids the 400 error but causes vLLM to crash with a 500 internal error: list index out of range. Since both approaches fail while single-image input works, it seems vLLM’s Qwen3-VL path does not yet handle multiple image embeddings properly.

DarkLight1337 · January 13, 2026, 4:01am

Can you show your new code?

doanhoang1508 · January 13, 2026, 4:30am

Here is my latest code

import torch
from openai import OpenAI
import base64
import io

def inference_with_vllm_embeds():
    client = OpenAI(
        base_url="http://localhost:8001/v1",
        api_key="EMPTY",
    )

    model = client.models.list().data[0].id

    prompt = "OCR:"
    image_embedding = torch.zeros((220, 8192))
    two_image_embeddings = torch.stack([image_embedding, image_embedding])
    two_image_embeddings = two_image_embeddings.view(-1, 8192)
    buffer = io.BytesIO()
    torch.save(two_image_embeddings, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    base64_image_embedding = base64.b64encode(binary_data).decode('utf-8')

    thw_embedding = torch.tensor([[1, 22, 40]])
    two_thw_embeddings = torch.stack([thw_embedding, thw_embedding])
    two_thw_embeddings = two_thw_embeddings.view(-1, 3)

    buffer = io.BytesIO()
    torch.save(two_thw_embeddings, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    base64_image_grid_thw = base64.b64encode(binary_data).decode('utf-8')

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_embeds",
                    "image_embeds": {
                        "image_embeds": f"{base64_image_embedding}" ,
                        "image_grid_thw": f"{base64_image_grid_thw}"  
                    },
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]

    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    print(response)
    

if __name__ == "__main__":
    inference_with_vllm_embeds()

DarkLight1337 · January 14, 2026, 4:09am

Can you try concatenating the embeddings (but not thw) to make (440, 8192)? This model combines embeddings from different items by concatenating instead of stacking.

doanhoang1508 · January 14, 2026, 4:48am

I tried concatenating the embeddings exactly as you suggested but the result is still the same:

openai.InternalServerError: Error code: 500 - {'error': {'message': 'list index out of range', 'type': 'Internal Server Error', 'param': None, 'code': 500}}

At this point, is there any official documentation or a working example that shows the correct way to pass multiple image embedding inputs with vLLM?

DarkLight1337 · January 14, 2026, 5:12am

I think no then, at least for this model. Can you open an issue on GitHub so we can track this?

doanhoang1508 · January 14, 2026, 7:43am

Ok, thank you. Let move to github: [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL · Issue #32309 · vllm-project/vllm · GitHub

Topic		Replies	Views
How do I precompute multimodal embeddings? Multi-modality	5	137	February 2, 2026
Compressed Multimodal embeddings inputs Multi-modality	1	26	March 18, 2026
Multimodal inference guideline? General	59	2294	August 6, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	176	June 17, 2025
Speeding up vllm inference for Qwen2.5-VL General	23	7055	June 27, 2025

Issue: Unable to pass precomputed image embeddings to vLLM

Related topics