No throughput improvement for Quantized Qwen-2.5-7B-Instruct?

pengzhao0524-commits · November 13, 2025, 3:37pm

I performed LoRA-based fine-tuning on the Qwen-2.5-7B-Instruct model using A100 GPUs. For inference, I’m also running on A100. To improve tokens per second, I quantized the model with AWQ, but the throughput is slightly lower compared to the non-quantized version. Any suggestions?

RunLLM · November 13, 2025, 3:38pm

Quantizing with AWQ often reduces memory usage but may not improve—and can even reduce—throughput, especially on A100 GPUs and with large batch sizes. This is because AWQ requires dequantization to FP16 for matrix multiplication, making the process compute-bound at high batch sizes, which negates the expected speedup. In many cases, throughput with AWQ is only better at small batch sizes or with short prompts; for larger batches or longer contexts, the overhead of dequantization outweighs the memory savings, resulting in lower or similar throughput compared to FP16 models. This is a known limitation and is discussed in detail in the vLLM community and documentation. For higher throughput, consider using INT8 (W8A8) quantization, which allows native INT8 computation on tensor cores and is better optimized for throughput on A100 and newer GPUs. See INT8 W8A8 documentation and quantization performance discussions, AWQ performance issues, and quantization method comparison.

Would you like a step-by-step guide on switching to INT8 quantization or more detail on AWQ performance bottlenecks?

Sources:

Hint: Mention @RunLLM in the post for followups.

pengzhao0524-commits · November 13, 2025, 3:46pm

Thanks! This is very helpful! I used to have a LoRA-based fine-tuned Mistral-7B-0.2 model that was trained on V100 and then ran inference on A100 with vLLM. In that case, the quantized version achieved better throughput compared to the non-quantized version, as expected. Why did it work, considering inference was also on A100? Does the GPU used for fine-tuning matter? Or does Mistral have a different architecture that makes quantization more effective?

pengzhao0524-commits · November 13, 2025, 3:55pm

@RunLLM I added a follow up question as above.

RunLLM · November 13, 2025, 3:56pm

Could you clarify your follow-up question? Are you asking for a step-by-step guide to use INT8 quantization with vLLM, or do you want a deeper explanation of why AWQ quantization can reduce throughput on A100 GPUs? Please specify your focus so I can provide the most relevant details.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Problem with Qwen2.5-VL-7b General	3	1211	July 9, 2025
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	2	508	March 18, 2026
Speeding up vllm inference for Qwen2.5-VL General	23	6865	June 27, 2025
Suggestion to improve inferencing speed Model Support	17	369	March 11, 2026
A bit of frustration with Quantization Quantization	5	602	October 14, 2025

No throughput improvement for Quantized Qwen-2.5-7B-Instruct?

Related topics