Wanted to host Qwen 2.5 VL, but the inference speed is much less as compared to just using the text modality.
Inference speed for Qwen 2.5 VL (Vision-Language) models is significantly slower than text-only models, especially when processing images or videos. This is expected due to the additional overhead of multi-modal input processing and larger input embeddings. For example, users have reported that Qwen2.5-VL models are much slower than their text-only counterparts, with time to first token and average tokens per second dropping substantially when vision inputs are included. Performance can also degrade further with more images or higher image resolutions in the prompt, and multi-modal preprocessing is currently performed on the CPU, which adds latency (source, source).
To mitigate this, you can try reducing the number or size of images per prompt (using limit_mm_per_prompt and mm_processor_kwargs), lowering max_model_len, or using quantized models. However, even with these optimizations, vision-language models will remain slower than text-only models due to the inherent complexity of multi-modal processing (source, source). Would you like more detail or code examples for optimizing your setup?
Sources:
Could you expand a bit more about “inference speed” here?
Image + text inference should generally be slower than text-only inference in terms of time-to-first-token due to the additional image processing and encoding in prefilling the request, but decoding performance should be identical.
Yes, “inference speed” in this context refers to both the time-to-first-token (TTFT) and the average tokens generated per second. For Qwen 2.5 VL and similar vision-language models, TTFT is significantly higher for image+text inputs because image preprocessing and encoding are performed before the model can start generating output. This preprocessing is often CPU-bound and can add several seconds of latency (source, source). However, once the initial prefill is complete and decoding begins, the per-token generation speed (decode throughput) is generally similar to text-only inference, since the decoding phase is dominated by the language model’s GPU computation (source).
So, the main slowdown for multi-modal inference is in the prefill (prompt processing) stage, not in the token-by-token decoding. This means that for short outputs, the overall latency is much higher for image+text, but for long generations, the difference narrows as decoding dominates total time. Would you like a breakdown of profiling or optimization tips?
Sources:
For my fixed setup, I am getting 1.5 rps with p95 of 10sec in a g5 instance with text only inference whereas I am only getting about 0.25 rps with p95 of 10sec in same g5 instance with image included.
I understand that the tokens are higher in case of image (for me its around 2000 tokens more than text only). But it still doesn’t justify the observed decrease in the throughput.
Your observed drop in throughput (from 1.5 rps to 0.25 rps) when including images is consistent with known bottlenecks in Qwen2.5-VL and similar vision-language models. The main cause is slow image preprocessing and encoding, which is CPU-bound and not parallelized efficiently, leading to much lower request throughput even if GPU utilization is low (source, source). The increase in tokens alone does not account for the large drop; the bottleneck is the time spent in the prefill stage, especially for image inputs.
This is a common limitation, and even with batching or multiple API workers, the image preprocessing step remains a serial bottleneck. Upgrading to a faster image processor (if available), reducing image size, or preprocessing images client-side can help, but the fundamental throughput gap between text-only and image+text is expected with current vLLM and Qwen2.5-VL implementations (source). Would you like more detail or profiling suggestions?
Sources: