vLLM is usable for smaller models and consumer GPUs, including Ada (RTX 4500) and Ampere cards, but quantization support is hardware- and method-dependent. For INT4 (W4A16) quantization, vLLM officially supports NVIDIA GPUs with compute capability >8.0 (Ampere, Ada, Hopper, Blackwell), and you can quantize and run models using the llm-compressor workflow. However, Blackwell (RTX 6000) support for quantization is still limited, and some quantization methods (e.g., FP8) are not yet fully supported on this architecture, as you observed. AutoAWQ is not actively maintained, but llm-compressor is the recommended tool for INT4 quantization with vLLM.
For smaller models, you can use vLLM with quantized models (e.g., INT4, INT8, AWQ, GPTQ) on consumer GPUs, provided you use the correct quantization method and hardware. The process is documented and stable for INT4/INT8 on Ada/Ampere/Hopper, but the ecosystem is evolving, and some tools may have breaking changes or limited support for the latest GPUs. If you share your build scripts and error logs, the community may help debug specific issues. For step-by-step quantization and inference instructions, see the official vLLM INT4 guide.
Would you like a detailed step-by-step example for quantizing and running a small model on your RTX 4500?
Sources:
Hint: Mention @RunLLM in the post for followups.