Speeding up vllm inference for Qwen2.5-VL

benchislett · June 11, 2025, 5:24pm

If possible, you could also try resizing your images so they are smaller. This may not be possible for OCR, but Qwen VL uses more prompt tokens for larger images so using a smaller image can have a noticeable impact.

RunLLM · June 11, 2025, 5:25pm

Yes, resizing images to smaller dimensions can significantly reduce the number of vision tokens Qwen VL uses, improving both speed and memory usage. This is especially impactful for OCR tasks, as larger images generate more tokens and increase inference time (performance discussion, docs). You can control this in vLLM using the mm_processor_kwargs parameter, e.g., mm_processor_kwargs={"max_pixels": 768*768}.

Would you like a code example for setting image size limits in vLLM?

Sources:

Performance discussion

Haoxiang-Wang · June 27, 2025, 11:07pm

Question: what’s Qwen-2.5 VL 7B’s throughput on 1x H100 in the batch_size = 1 setup

RunLLM · June 27, 2025, 11:08pm

It seems likely that Qwen2.5-VL-7B achieves around 130–200 output tokens/s on a single H100 GPU with batch_size=1, based on recent vLLM benchmarks (benchmark example). Actual throughput may vary depending on prompt length, image size, and model settings.

Would you like more detailed benchmark numbers or configuration advice?

Sources:

Benchmark results for Qwen2.5-VL-7B-Instruct

Topic		Replies	Views
Help needed with vLLM Qwen3-32B inference performance tuning (long input slow) General	1	149	May 26, 2025
Vllm throughput less on 7B in comparison to 32B General	1	10	June 25, 2025
Is Batch Inference for Multimodal Models Truly Batch Inference? General	5	70	May 14, 2025
Multimodal inference guideline? General	53	154	July 2, 2025
Using guided decoding for JSON General	1	30	June 26, 2025

Speeding up vllm inference for Qwen2.5-VL

Related topics