[Field Report] AWQ on RTX 5060 Ti (SM_120 / Blackwell) — awq_marlin + TRITON_ATTN working

Alkevas · March 18, 2026, 5:10am

After a lot of trial and error I finally got AWQ models running stable on my RTX 5060 Ti in WSL2. Sharing this because I couldn’t find any documentation on this specific combination anywhere. Hope it helps the team and other Blackwell users.

Setup:

GPU: NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / SM_120 / Blackwell)
OS: Windows 11 + WSL2 (Ubuntu)
PyTorch: 2.10.0+cu130
vLLM: 0.17.2rc1.dev45+g761e0aa7a
Frontend: Chatbox on Windows → http://localhost:8000/v1

Root cause

Blackwell GPUs (SM_120) are forced to bfloat16. Standard AWQ requires float16 and crashes immediately with a pydantic ValidationError. FlashAttention has no SM_120 support yet either.

Confirmed NOT working on SM_120:

--quantization awq → crashes (requires float16, SM_120 forces bfloat16)
--quantization gptq → broken
BitsAndBytes → garbage/corrupt output
FlashAttention → not supported on SM_120

Working solution — two flags:

vllm serve <model> \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --quantization awq_marlin \
  --attention-backend TRITON_ATTN

Confirmed working — three architectures, three companies:

Model	Family	Size	First token latency
hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4	Meta / Llama	8B	338ms
casperhansen/mistral-nemo-instruct-2407-awq	Mistral	12B	437ms
Qwen/Qwen2.5-14B-Instruct-AWQ	Qwen	14B	520ms

Pattern: larger model = higher latency, all stable, all on the same two flags.

Performance on Qwen 2.5 14B AWQ:

Generation throughput: ~30 tokens/s (peak)
GPU KV cache usage: 1.5%
16GB VRAM

Note on Gemma 2:
Gemma 2 AWQ loads fine with awq_marlin + TRITON_ATTN, but Gemma 2 does not support system role in its chat template. Leave system prompt empty in your frontend to avoid “System role not supported” errors — this is a Gemma 2 limitation, not a vLLM issue.

Hope this is useful for SM_120 / Blackwell support going forward. Happy to provide more data or test specific models if helpful.

RunLLM · March 18, 2026, 5:10am

Thank you for sharing your findings. Your experience aligns with current vLLM support: on Blackwell (SM_120) GPUs, standard AWQ quantization fails due to enforced bfloat16, but the Marlin kernel (--quantization awq_marlin) works and is the recommended path for AWQ models on these GPUs. Using --attention-backend TRITON_ATTN is also correct, as FlashAttention does not yet support SM_120. Your reported performance and model compatibility are consistent with recent vLLM updates and user reports, and your workaround is the officially suggested method for AWQ on Blackwell hardware as of vLLM 0.17.x and CUDA 13+.

For others: always use awq_marlin quantization and avoid FlashAttention on Blackwell. If you encounter issues with chat templates (e.g., Gemma 2), these are model-specific, not vLLM bugs. For more details or troubleshooting steps, let me know.

Sources:

Recent Blackwell/SM_120 user reports and workarounds

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	2	496	March 18, 2026
Support for RTX 6000 Blackwell 96GB card NVIDIA GPU Support	5	4728	January 5, 2026
GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro Quantization	1	57	March 22, 2026
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	136	November 27, 2025
Proper settings for running qwen2.5 72b on 48gb using awq General	1	858	June 21, 2025

[Field Report] AWQ on RTX 5060 Ti (SM_120 / Blackwell) — awq_marlin + TRITON_ATTN working

Related topics