Why is inference for Qwen 2.5 VL so slow when we send an image?

dhruvil237 · August 19, 2025, 8:38pm

Wanted to host Qwen 2.5 VL, but the inference speed is much less as compared to just using the text modality.

RunLLM · August 19, 2025, 8:38pm

Inference speed for Qwen 2.5 VL (Vision-Language) models is significantly slower than text-only models, especially when processing images or videos. This is expected due to the additional overhead of multi-modal input processing and larger input embeddings. For example, users have reported that Qwen2.5-VL models are much slower than their text-only counterparts, with time to first token and average tokens per second dropping substantially when vision inputs are included. Performance can also degrade further with more images or higher image resolutions in the prompt, and multi-modal preprocessing is currently performed on the CPU, which adds latency (source, source).

To mitigate this, you can try reducing the number or size of images per prompt (using limit_mm_per_prompt and mm_processor_kwargs), lowering max_model_len, or using quantized models. However, even with these optimizations, vision-language models will remain slower than text-only models due to the inherent complexity of multi-modal processing (source, source). Would you like more detail or code examples for optimizing your setup?

Sources:

ywang96 · August 19, 2025, 8:43pm

Could you expand a bit more about “inference speed” here?

Image + text inference should generally be slower than text-only inference in terms of time-to-first-token due to the additional image processing and encoding in prefilling the request, but decoding performance should be identical.

RunLLM · August 19, 2025, 8:43pm

Yes, “inference speed” in this context refers to both the time-to-first-token (TTFT) and the average tokens generated per second. For Qwen 2.5 VL and similar vision-language models, TTFT is significantly higher for image+text inputs because image preprocessing and encoding are performed before the model can start generating output. This preprocessing is often CPU-bound and can add several seconds of latency (source, source). However, once the initial prefill is complete and decoding begins, the per-token generation speed (decode throughput) is generally similar to text-only inference, since the decoding phase is dominated by the language model’s GPU computation (source).

So, the main slowdown for multi-modal inference is in the prefill (prompt processing) stage, not in the token-by-token decoding. This means that for short outputs, the overall latency is much higher for image+text, but for long generations, the difference narrows as decoding dominates total time. Would you like a breakdown of profiling or optimization tips?

Sources:

dhruvil237 · August 20, 2025, 5:58am

For my fixed setup, I am getting 1.5 rps with p95 of 10sec in a g5 instance with text only inference whereas I am only getting about 0.25 rps with p95 of 10sec in same g5 instance with image included.

I understand that the tokens are higher in case of image (for me its around 2000 tokens more than text only). But it still doesn’t justify the observed decrease in the throughput.

RunLLM · August 20, 2025, 5:59am

Your observed drop in throughput (from 1.5 rps to 0.25 rps) when including images is consistent with known bottlenecks in Qwen2.5-VL and similar vision-language models. The main cause is slow image preprocessing and encoding, which is CPU-bound and not parallelized efficiently, leading to much lower request throughput even if GPU utilization is low (source, source). The increase in tokens alone does not account for the large drop; the bottleneck is the time spent in the prefill stage, especially for image inputs.

This is a common limitation, and even with batching or multiple API workers, the image preprocessing step remains a serial bottleneck. Upgrading to a faster image processor (if available), reducing image size, or preprocessing images client-side can help, but the fundamental throughput gap between text-only and image+text is expected with current vLLM and Qwen2.5-VL implementations (source). Would you like more detail or profiling suggestions?

Sources:

Topic		Replies	Views
Speeding up vllm inference for Qwen2.5-VL General	23	7671	June 27, 2025
Problem with Qwen2.5-VL-7b General	3	1450	July 9, 2025
Is Batch Inference for Multimodal Models Truly Batch Inference? General	9	564	August 1, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	147	September 14, 2025
Qwen 2.5 VL for videos General	1	1037	August 26, 2025

Why is inference for Qwen 2.5 VL so slow when we send an image?

Related topics