Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B)

I’m currently deploying the Qwen2.5-VL-32B-Instruct-AWQ model using vLLM on a RTX A6000 GPU with CUDA 12.7. The typical workload includes:

  • Average input tokens: ~2000
  • Average output tokens: ~100
  • Batch size: 1
  • VRAM usage: ~44GB
  • Response time: around 5–9 seconds per request

The vLLM launch command I’m using:

vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
  --port 8000 --host 0.0.0.0 \
  --dtype bfloat16 \
  --api-key cbpai \
  --limit-mm-per-prompt image=5,video=5 \
  --max_model_len 70000

However, when I switch to the following configuration:

vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
  --port 8000 --host 0.0.0.0 \
  --dtype float16 \
  --api-key cbpai \
  --limit-mm-per-prompt image=5,video=5 \
  --max_model_len 70000 \
  --gpu_memory_utilization 0.95 \
  --quantization awq

the VRAM usage increases to ~47GB, and the response time significantly slows down to 30–40 seconds per request.

My questions are:

  1. Is there an issue with the second configuration that’s causing such a performance drop?
  2. When I try to run --quantization=awq with --dtype=bfloat16, vLLM throws an error stating that bfloat16 is not supported with awq. Could you clarify why this combination is not allowed?

AWQ quantization in vLLM only supports float16 (not bfloat16) due to kernel limitations; using --dtype=bfloat16 with --quantization=awq will result in an error, as documented in multiple issues and confirmed by error messages in the logs (source, source, source). The performance drop and increased VRAM usage when switching to float16+AWQ can occur because AWQ kernels are not fully optimized and may be slower than non-quantized models on some hardware (source, source), and bfloat16 is not supported for AWQ quantization.

Would you like a step-by-step breakdown of why AWQ+float16 can be slower and why bfloat16 is unsupported?

Sources: