I’m currently deploying the Qwen2.5-VL-32B-Instruct-AWQ model using vLLM on a RTX A6000 GPU with CUDA 12.7. The typical workload includes:
- Average input tokens: ~2000
- Average output tokens: ~100
- Batch size: 1
- VRAM usage: ~44GB
- Response time: around 5–9 seconds per request
The vLLM launch command I’m using:
vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
--port 8000 --host 0.0.0.0 \
--dtype bfloat16 \
--api-key cbpai \
--limit-mm-per-prompt image=5,video=5 \
--max_model_len 70000
However, when I switch to the following configuration:
vllm serve Qwen/Qwen2.5-VL-32B-Instruct-AWQ \
--port 8000 --host 0.0.0.0 \
--dtype float16 \
--api-key cbpai \
--limit-mm-per-prompt image=5,video=5 \
--max_model_len 70000 \
--gpu_memory_utilization 0.95 \
--quantization awq
the VRAM usage increases to ~47GB, and the response time significantly slows down to 30–40 seconds per request.
My questions are:
- Is there an issue with the second configuration that’s causing such a performance drop?
- When I try to run
--quantization=awqwith--dtype=bfloat16, vLLM throws an error stating thatbfloat16 is not supported with awq. Could you clarify why this combination is not allowed?