If possible, you could also try resizing your images so they are smaller. This may not be possible for OCR, but Qwen VL uses more prompt tokens for larger images so using a smaller image can have a noticeable impact.
Yes, resizing images to smaller dimensions can significantly reduce the number of vision tokens Qwen VL uses, improving both speed and memory usage. This is especially impactful for OCR tasks, as larger images generate more tokens and increase inference time (performance discussion, docs). You can control this in vLLM using the mm_processor_kwargs
parameter, e.g., mm_processor_kwargs={"max_pixels": 768*768}
.
Would you like a code example for setting image size limits in vLLM?
Sources:
Question: what’s Qwen-2.5 VL 7B’s throughput on 1x H100 in the batch_size = 1 setup
It seems likely that Qwen2.5-VL-7B achieves around 130–200 output tokens/s on a single H100 GPU with batch_size=1, based on recent vLLM benchmarks (benchmark example). Actual throughput may vary depending on prompt length, image size, and model settings.
Would you like more detailed benchmark numbers or configuration advice?
Sources: