Speeding up vllm inference for Qwen2.5-VL

sickerin · May 30, 2025, 9:03am

Will vllm work on a machine with Cuda 12.2 (driver 5.3.5) with 6 Tesla T4 GPU

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --tensor-parallel-size 4 --enable-chunked-prefill --enable-prefix-caching --host 0.0.0.0 --port 6161 --dtype half

Topic		Replies	Views
Multimodal inference guideline? General	59	2675	August 6, 2025
Why is inference for Qwen 2.5 VL so slow when we send an image? General	5	1685	August 20, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	123	September 14, 2025
Problem with Qwen2.5-VL-7b General	3	1367	July 9, 2025
Qwen 2.5 VL for videos General	1	981	August 26, 2025

Speeding up vllm inference for Qwen2.5-VL

Related topics