Speeding up vllm inference for Qwen2.5-VL

RunLLM · May 14, 2025, 9:42am

The Qwen/Qwen2.5-VL-7B-Instruct-AWQ model uses AWQ 4-bit quantization, which is supported by vLLM for efficient inference. For your workflow—fine-tuning and then quantizing—it’s generally recommended to fine-tune the model in full precision (FP16/BF16), then quantize the resulting checkpoint for best accuracy. LLaMA-Factory supports LoRA/QLoRA fine-tuning, but for quantization, you should use tools like llm-compressor or AutoAWQ after fine-tuning, as described in the vLLM quantization docs.

Would you like step-by-step guidance on the fine-tune-then-quantize workflow for Qwen2.5-VL-7B?

Sources:

Topic		Replies	Views
Multimodal inference guideline? General	59	2945	August 6, 2025
Why is inference for Qwen 2.5 VL so slow when we send an image? General	5	1744	August 20, 2025
what is the recommended method to start up the vllm server engine for inferencing for InternVL3_5-8B, getting 2 qps? General	1	141	September 14, 2025
Problem with Qwen2.5-VL-7b General	3	1431	July 9, 2025
Qwen 2.5 VL for videos General	1	1026	August 26, 2025

Speeding up vllm inference for Qwen2.5-VL

Related topics