Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B)

ghiaog123 · July 23, 2025, 4:23pm

I’m currently deploying the Qwen2.5-VL-32B-Instruct-AWQ model using vLLM on a RTX A6000 GPU with CUDA 12.7. The typical workload includes:

Average input tokens: ~2000
Average output tokens: ~100
Batch size: 1
VRAM usage: ~44GB
Response time: around 5–9 seconds per request

The vLLM launch command I’m using:

vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
  --port 8000 --host 0.0.0.0 \
  --dtype bfloat16 \
  --api-key cbpai \
  --limit-mm-per-prompt image=5,video=5 \
  --max_model_len 70000

However, when I switch to the following configuration:

vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
  --port 8000 --host 0.0.0.0 \
  --dtype float16 \
  --api-key cbpai \
  --limit-mm-per-prompt image=5,video=5 \
  --max_model_len 70000 \
  --gpu_memory_utilization 0.95 \
  --quantization awq

the VRAM usage increases to ~47GB, and the response time significantly slows down to 30–40 seconds per request.

My questions are:

Is there an issue with the second configuration that’s causing such a performance drop?
When I try to run --quantization=awq with --dtype=bfloat16, vLLM throws an error stating that bfloat16 is not supported with awq. Could you clarify why this combination is not allowed?

RunLLM · July 23, 2025, 4:23pm

AWQ quantization in vLLM only supports float16 (not bfloat16) due to kernel limitations; using --dtype=bfloat16 with --quantization=awq will result in an error, as documented in multiple issues and confirmed by error messages in the logs (source, source, source). The performance drop and increased VRAM usage when switching to float16+AWQ can occur because AWQ kernels are not fully optimized and may be slower than non-quantized models on some hardware (source, source), and bfloat16 is not supported for AWQ quantization.

Would you like a step-by-step breakdown of why AWQ+float16 can be slower and why bfloat16 is unsupported?

Sources:

Topic		Replies	Views
Proper settings for running qwen2.5 72b on 48gb using awq General	1	763	June 21, 2025
No throughput improvement for Quantized Qwen-2.5-7B-Instruct? General	4	126	November 13, 2025
A bit of frustration with Quantization Quantization	5	555	October 14, 2025
Problem with Qwen2.5-VL-7b General	3	1134	July 9, 2025
Suggestion to improve inferencing speed Model Support	1	226	December 4, 2025

Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B)

Related topics