A bit of frustration with Quantization

RunLLM · October 13, 2025, 8:29pm

vLLM is usable for smaller models and consumer GPUs, including Ada (RTX 4500) and Ampere cards, but quantization support is hardware- and method-dependent. For INT4 (W4A16) quantization, vLLM officially supports NVIDIA GPUs with compute capability >8.0 (Ampere, Ada, Hopper, Blackwell), and you can quantize and run models using the llm-compressor workflow. However, Blackwell (RTX 6000) support for quantization is still limited, and some quantization methods (e.g., FP8) are not yet fully supported on this architecture, as you observed. AutoAWQ is not actively maintained, but llm-compressor is the recommended tool for INT4 quantization with vLLM.

For smaller models, you can use vLLM with quantized models (e.g., INT4, INT8, AWQ, GPTQ) on consumer GPUs, provided you use the correct quantization method and hardware. The process is documented and stable for INT4/INT8 on Ada/Ampere/Hopper, but the ecosystem is evolving, and some tools may have breaking changes or limited support for the latest GPUs. If you share your build scripts and error logs, the community may help debug specific issues. For step-by-step quantization and inference instructions, see the official vLLM INT4 guide.

Would you like a detailed step-by-step example for quantizing and running a small model on your RTX 4500?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
RuntimeError: Int8 not supported on SM120. Use FP8 quantization instead, or run on older arch (SM < 100) NVIDIA GPU Support	3	214	November 27, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	123	May 19, 2025
Asking 6-bit Quantization Quantization	1	200	November 11, 2025
Gemma 3 Quantization General	5	1532	June 21, 2025
GLM 5 / Kimi k2.5 on 4 x RTX 6000 Pro Quantization	1	215	March 22, 2026

A bit of frustration with Quantization

Related topics