No throughput improvement for Quantized Qwen-2.5-7B-Instruct?

I performed LoRA-based fine-tuning on the Qwen-2.5-7B-Instruct model using A100 GPUs. For inference, I’m also running on A100. To improve tokens per second, I quantized the model with AWQ, but the throughput is slightly lower compared to the non-quantized version. Any suggestions?

Quantizing with AWQ often reduces memory usage but may not improve—and can even reduce—throughput, especially on A100 GPUs and with large batch sizes. This is because AWQ requires dequantization to FP16 for matrix multiplication, making the process compute-bound at high batch sizes, which negates the expected speedup. In many cases, throughput with AWQ is only better at small batch sizes or with short prompts; for larger batches or longer contexts, the overhead of dequantization outweighs the memory savings, resulting in lower or similar throughput compared to FP16 models. This is a known limitation and is discussed in detail in the vLLM community and documentation. For higher throughput, consider using INT8 (W8A8) quantization, which allows native INT8 computation on tensor cores and is better optimized for throughput on A100 and newer GPUs. See INT8 W8A8 documentation and quantization performance discussions, AWQ performance issues, and quantization method comparison.

Would you like a step-by-step guide on switching to INT8 quantization or more detail on AWQ performance bottlenecks?

Sources:

Hint: Mention @RunLLM in the post for followups.

Thanks! This is very helpful! I used to have a LoRA-based fine-tuned Mistral-7B-0.2 model that was trained on V100 and then ran inference on A100 with vLLM. In that case, the quantized version achieved better throughput compared to the non-quantized version, as expected. Why did it work, considering inference was also on A100? Does the GPU used for fine-tuning matter? Or does Mistral have a different architecture that makes quantization more effective?

@RunLLM I added a follow up question as above.

Could you clarify your follow-up question? Are you asking for a step-by-step guide to use INT8 quantization with vLLM, or do you want a deeper explanation of why AWQ quantization can reduce throughput on A100 GPUs? Please specify your focus so I can provide the most relevant details.

Sources:

Hint: Mention @RunLLM in the post for followups.