Issue: Unable to pass precomputed image embeddings to vLLM

Hi, I’m trying to use precomputed image embeddings with vLLM following the documentation here: https://docs.vllm.ai/en/stable/features/multimodal_inputs/#image-embedding-inputs.
According to the docs, vLLM accepts:

  • A single embedding as a 3D tensor of shape (1, feature_size, hidden_size)

  • Or multiple embeddings as a list of 2D tensors, each (feature_size, hidden_size)

However, when I send these embeddings (even when stacked to (2, 416, 1024) for two images), vLLM throws:

ValueError: image_embeds has rank 3 but expected 2

This is my code and how i start the server:

import torch
import base64
import io
from openai import OpenAI
from vllm.utils.serial_utils import tensor2base64

def run_multi_image_inference():
client = OpenAI(
base_url=“http://localhost:8001/v1”,
api_key=“EMPTY”,
)

# 1. Prepare your multimodal data
# Real image embeddings from Qwen-VL vision encoder usually have 416 tokens
img1_embed = torch.randn(416, 1024) 
img1_thw = torch.tensor([1, 32, 52]) 

img2_embed = torch.randn(416, 1024)
img2_thw = torch.tensor([1, 32, 52]) 

# 2. Build the messages list
# Note: Interleave the image_embeds blocks with the text prompt
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image_embeds",
                "image_embeds": {
                    "image_embeds": tensor2base64(img1_embed),
                    "image_grid_thw": tensor2base64(img1_thw),
                }
            },
            {
                "type": "image_embeds",
                "image_embeds": {
                    "image_embeds": tensor2base64(img2_embed),
                    "image_grid_thw": tensor2base64(img2_thw),
                }
            },
            {
                "type": "text", 
                "text": "Compare these two images and tell me the differences."
            }
        ]
    }
]

# 3. Request completion
try:
    response = client.chat.completions.create(
        model="Qwen/Qwen3-VL-2B-Instruct", # Use exact name from your server
        messages=messages,
        max_tokens=128,
        temperature=0.2
    )
    print("\n--- Model Response ---\n")
    print(response.choices[0].message.content)
except Exception as e:
    print(f"Inference failed: {e}")

if name == “main”:
run_multi_image_inference()

vllm serve Qwen/Qwen3-VL-2B-Instruct --port 8001 --enable-mm-embeds --limit-mm-per-prompt ‘{“image”: 2}’ --max-model-len 8192 --trust-remote-code

Is there a mismatch between:

  • The documentation
    vs

  • The actual implementation of the multimodal embedding parser?

Or is there an additional constraint on how multi-image embeddings must be packaged that isn’t documented?

Yes, there is a subtle but important constraint: for multi-image embedding input via the OpenAI API, vLLM currently only supports one message with type “image_embeds” per request. The documentation states that “multiple messages can now contain {‘type’: ‘image_embeds’},” but in practice, only one message with image embeddings is supported per request, and the embeddings for multiple images must be packed together as a single tensor or dict, not as multiple separate “image_embeds” blocks in the message content. If you send multiple “image_embeds” blocks, vLLM will not merge them and will raise a shape or type error (see docs, see discussion, see issue).

To pass multiple image embeddings, you should stack them into a single tensor of shape (num_images, feature_size, hidden_size) and send them as a single “image_embeds” block, e.g.:

embeds = {
    "type": "image_embeds",
    "image_embeds": {
        "image_embeds": tensor2base64(torch.stack([img1_embed, img2_embed])),
        "image_grid_thw": tensor2base64(torch.stack([img1_thw, img2_thw])),
    }
}
# Then use only one such block in your message content.

This matches the expected input for Qwen2/3-VL and avoids the ValueError (see docs, see code example). Would you like a step-by-step code example for this?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Thank you — that constraint makes sense. However, I have already tried stacking all image embeddings into a single image_embeds block (as a (num_images, F, H) tensor for image_embeds and (num_images, 3) for image_grid_thw), and I still get an internal error from vLLM:

Inference failed: Error code: 500 - {‘error’: {‘message’: ‘list index out of range’, ‘type’: ‘Internal Server Error’, ‘param’: None, ‘code’: 500}}

You can try this code: [Bug]: When using image_embeds, ImageProcessorItems are used instead of ImageEmbeddingItems, causing an out-of-bounds array error. · Issue #31708 · vllm-project/vllm · GitHub

@DarkLight1337 I tried the exact sample code from this comment but I still hit a fatal error inside vLLM when running Qwen3-VL-2B-Instruct with image embeddings.

RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024
(input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]

(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.13.0) with config: model=‘Qwen/Qwen3-VL-2B-Instruct’, speculative_config=None, tokenizer=‘Qwen/Qwen3-VL-2B-Instruct’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=Qwen/Qwen3-VL-2B-Instruct, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘/home/hoangtd/.cache/vllm/torch_compile_cache/99725bf3c4’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [2048], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘eliminate_noops’: True, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False}, ‘local_cache_dir’: ‘/home/hoangtd/.cache/vllm/torch_compile_cache/99725bf3c4/rank_0_0/backbone’},
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-9973e15133eb53c3,prompt_token_ids_len=428,mm_features=[MultiModalFeatureSpec(data={‘image_grid_thw’: MultiModalFieldElem(modality=‘image’, key=‘image_grid_thw’, data=tensor([ 1, 32, 52]), field=MultiModalBatchedField(keep_on_cpu=True)), ‘image_embeds’: MultiModalFieldElem(modality=‘image’, key=‘image_embeds’, data=tensor([[0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] …,
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.],
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:79] [0., 0., 0., …, 0., 0., 0.]]), field=MultiModalFlatField(keep_on_cpu=False, slices=[[slice(0, 1664, None)]], dim=0))}, modality=‘image’, identifier=‘422d83049ac831542d162035c97b33b27331b51dd9c66cc31d8ad99a848f2f62’, mm_position=PlaceholderRange(offset=4, length=416, is_embed=None))],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=, stop_token_ids=[151643], bad_words=, include_stop_str_in_output=False, ignore_eos=False, max_tokens=8179, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, structured_outputs=None, extra_args=None),block_ids=([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27],),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=, resumed_req_ids=, new_token_ids=, all_token_ids={}, new_block_ids=, num_computed_tokens=, num_output_tokens=), num_scheduled_tokens={chatcmpl-9973e15133eb53c3: 428}, total_num_scheduled_tokens=428, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={chatcmpl-9973e15133eb53c3: [0]}, num_common_prefix_blocks=[27], finished_req_ids=, free_encoder_mm_hashes=, preempted_req_ids=, pending_structured_output_tokens=false, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0030477480528275924, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=428, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=, spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] Traceback (most recent call last):
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 859, in run_engine_core
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] engine_core.run_busy_loop()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 886, in run_busy_loop
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] self._process_engine_step()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 919, in _process_engine_step
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 351, in step
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] model_output = future.result()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 451, in result
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return self.__get_result()
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] raise self._exception
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py”, line 79, in collective_rpc
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 461, in run_method
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 369, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 623, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] output = self.model_runner.execute_model(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3060, in execute_model
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = self._preprocess(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 2471, in _preprocess
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] inputs_embeds_scheduled = self.model.embed_input_ids(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1992, in embed_input_ids
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = self._compute_deepstack_embeds(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1939, in _compute_deepstack_embeds
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] ) = torch.split(
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/functional.py”, line 173, in split
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return tensor.split(split_size_or_sections, dim)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/_tensor.py”, line 1030, in split
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] return torch._VF.split_with_sizes(self, split_size, dim)
(EngineCore_DP0 pid=2053438) ERROR 01-12 16:57:49 [core.py:868] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024 (input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]
(EngineCore_DP0 pid=2053438) Process EngineCore_DP0:
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] AsyncLLM output_handler failed.
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] Traceback (most recent call last):
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py”, line 490, in output_handler
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] outputs = await engine_core.get_output_async()
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core_client.py”, line 895, in get_output_async
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] raise self._format_exception(outputs) from None
(APIServer pid=2053237) ERROR 01-12 16:57:49 [async_llm.py:538] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=2053438) Traceback (most recent call last):
(APIServer pid=2053237) INFO: 127.0.0.1:14270 - “POST /v1/chat/completions HTTP/1.1” 500 Internal Server Error
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/multiprocessing/process.py”, line 314, in _bootstrap
(EngineCore_DP0 pid=2053438) self.run()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/multiprocessing/process.py”, line 108, in run
(EngineCore_DP0 pid=2053438) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 870, in run_engine_core
(EngineCore_DP0 pid=2053438) raise e
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 859, in run_engine_core
(EngineCore_DP0 pid=2053438) engine_core.run_busy_loop()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 886, in run_busy_loop
(EngineCore_DP0 pid=2053438) self._process_engine_step()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 919, in _process_engine_step
(EngineCore_DP0 pid=2053438) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/engine/core.py”, line 351, in step
(EngineCore_DP0 pid=2053438) model_output = future.result()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 451, in result
(EngineCore_DP0 pid=2053438) return self.__get_result()
(EngineCore_DP0 pid=2053438) File “/usr/local/lib/python3.10/concurrent/futures/_base.py”, line 403, in __get_result
(EngineCore_DP0 pid=2053438) raise self._exception
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py”, line 79, in collective_rpc
(EngineCore_DP0 pid=2053438) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/serial_utils.py”, line 461, in run_method
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/worker_base.py”, line 369, in execute_model
(EngineCore_DP0 pid=2053438) return self.worker.execute_model(scheduler_output, *args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py”, line 623, in execute_model
(EngineCore_DP0 pid=2053438) output = self.model_runner.execute_model(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/utils/_contextlib.py”, line 120, in decorate_context
(EngineCore_DP0 pid=2053438) return func(*args, **kwargs)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 3060, in execute_model
(EngineCore_DP0 pid=2053438) ) = self._preprocess(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py”, line 2471, in _preprocess
(EngineCore_DP0 pid=2053438) inputs_embeds_scheduled = self.model.embed_input_ids(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1992, in embed_input_ids
(EngineCore_DP0 pid=2053438) ) = self._compute_deepstack_embeds(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py”, line 1939, in _compute_deepstack_embeds
(EngineCore_DP0 pid=2053438) ) = torch.split(
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/functional.py”, line 173, in split
(EngineCore_DP0 pid=2053438) return tensor.split(split_size_or_sections, dim)
(EngineCore_DP0 pid=2053438) File “/home/hoangtd/.dev/lib/python3.10/site-packages/torch/_tensor.py”, line 1030, in split
(EngineCore_DP0 pid=2053438) return torch._VF.split_with_sizes(self, split_size, dim)
(EngineCore_DP0 pid=2053438) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 1024 (input tensor’s size at dimension -1), but got split_sizes=[2048, 6144]

Likely the shape of the embeddings is different for this model. I only tested that code on Qwen2.5-VL.

Thanks — I’ve updated my code to match Qwen3-VL’s embedding structure, and it now works correctly for a single image embedding. So the embedding format itself is not the issue.
However, the problem appears when using multiple image embeddings.I tested both formats: (1) the structure shown in the documentation, which fails with a 400 error: size_per_item should be a 1-D tensor, but found shape: torch.Size([2, 1]); and (2) stacking multiple images with torch.stack(), which avoids the 400 error but causes vLLM to crash with a 500 internal error: list index out of range. Since both approaches fail while single-image input works, it seems vLLM’s Qwen3-VL path does not yet handle multiple image embeddings properly.

Can you show your new code?

Here is my latest code

import torch
from openai import OpenAI
import base64
import io

def inference_with_vllm_embeds():
    client = OpenAI(
        base_url="http://localhost:8001/v1",
        api_key="EMPTY",
    )

    model = client.models.list().data[0].id

    prompt = "OCR:"
    image_embedding = torch.zeros((220, 8192))
    two_image_embeddings = torch.stack([image_embedding, image_embedding])
    two_image_embeddings = two_image_embeddings.view(-1, 8192)
    buffer = io.BytesIO()
    torch.save(two_image_embeddings, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    base64_image_embedding = base64.b64encode(binary_data).decode('utf-8')

    thw_embedding = torch.tensor([[1, 22, 40]])
    two_thw_embeddings = torch.stack([thw_embedding, thw_embedding])
    two_thw_embeddings = two_thw_embeddings.view(-1, 3)

    buffer = io.BytesIO()
    torch.save(two_thw_embeddings, buffer)
    buffer.seek(0)
    binary_data = buffer.read()
    base64_image_grid_thw = base64.b64encode(binary_data).decode('utf-8')

    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "image_embeds",
                    "image_embeds": {
                        "image_embeds": f"{base64_image_embedding}" ,
                        "image_grid_thw": f"{base64_image_grid_thw}"  
                    },
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]

    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    print(response)
    

if __name__ == "__main__":
    inference_with_vllm_embeds()

Can you try concatenating the embeddings (but not thw) to make (440, 8192)? This model combines embeddings from different items by concatenating instead of stacking.

I tried concatenating the embeddings exactly as you suggested but the result is still the same:

openai.InternalServerError: Error code: 500 - {'error': {'message': 'list index out of range', 'type': 'Internal Server Error', 'param': None, 'code': 500}}

At this point, is there any official documentation or a working example that shows the correct way to pass multiple image embedding inputs with vLLM?

I think no then, at least for this model. Can you open an issue on GitHub so we can track this?

Ok, thank you. Let move to github: [Usage]: Unable to pass precomputed image embeddings to vLLM with Qwen3-VL · Issue #32309 · vllm-project/vllm · GitHub