The Qwen/Qwen2.5-VL-7B-Instruct-AWQ model uses AWQ 4-bit quantization, which is supported by vLLM for efficient inference. For your workflow—fine-tuning and then quantizing—it’s generally recommended to fine-tune the model in full precision (FP16/BF16), then quantize the resulting checkpoint for best accuracy. LLaMA-Factory supports LoRA/QLoRA fine-tuning, but for quantization, you should use tools like llm-compressor or AutoAWQ after fine-tuning, as described in the vLLM quantization docs.
Would you like step-by-step guidance on the fine-tune-then-quantize workflow for Qwen2.5-VL-7B?
Sources: