[Field Report] AWQ on RTX 5060 Ti (SM_120 / Blackwell) — awq_marlin + TRITON_ATTN working

After a lot of trial and error I finally got AWQ models running stable on my RTX 5060 Ti in WSL2. Sharing this because I couldn’t find any documentation on this specific combination anywhere. Hope it helps the team and other Blackwell users.


Setup:

  • GPU: NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / SM_120 / Blackwell)
  • OS: Windows 11 + WSL2 (Ubuntu)
  • PyTorch: 2.10.0+cu130
  • vLLM: 0.17.2rc1.dev45+g761e0aa7a
  • Frontend: Chatbox on Windows → http://localhost:8000/v1

Root cause

Blackwell GPUs (SM_120) are forced to bfloat16. Standard AWQ requires float16 and crashes immediately with a pydantic ValidationError. FlashAttention has no SM_120 support yet either.

Confirmed NOT working on SM_120:

  • --quantization awq → crashes (requires float16, SM_120 forces bfloat16)
  • --quantization gptq → broken
  • BitsAndBytes → garbage/corrupt output
  • FlashAttention → not supported on SM_120

Working solution — two flags:

vllm serve <model> \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --quantization awq_marlin \
  --attention-backend TRITON_ATTN

Confirmed working — three architectures, three companies:

Model Family Size First token latency
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 Meta / Llama 8B 338ms
casperhansen/mistral-nemo-instruct-2407-awq Mistral 12B 437ms
Qwen/Qwen2.5-14B-Instruct-AWQ Qwen 14B 520ms

Pattern: larger model = higher latency, all stable, all on the same two flags.

Performance on Qwen 2.5 14B AWQ:

  • Generation throughput: ~30 tokens/s (peak)
  • GPU KV cache usage: 1.5%
  • 16GB VRAM

Note on Gemma 2:
Gemma 2 AWQ loads fine with awq_marlin + TRITON_ATTN, but Gemma 2 does not support system role in its chat template. Leave system prompt empty in your frontend to avoid “System role not supported” errors — this is a Gemma 2 limitation, not a vLLM issue.


Hope this is useful for SM_120 / Blackwell support going forward. Happy to provide more data or test specific models if helpful.

Thank you for sharing your findings. Your experience aligns with current vLLM support: on Blackwell (SM_120) GPUs, standard AWQ quantization fails due to enforced bfloat16, but the Marlin kernel (--quantization awq_marlin) works and is the recommended path for AWQ models on these GPUs. Using --attention-backend TRITON_ATTN is also correct, as FlashAttention does not yet support SM_120. Your reported performance and model compatibility are consistent with recent vLLM updates and user reports, and your workaround is the officially suggested method for AWQ on Blackwell hardware as of vLLM 0.17.x and CUDA 13+.

For others: always use awq_marlin quantization and avoid FlashAttention on Blackwell. If you encounter issues with chat templates (e.g., Gemma 2), these are model-specific, not vLLM bugs. For more details or troubleshooting steps, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.