A bit of frustration with Quantization

I try now for almost a week to build some LLMs with a quantization equal or similar to w4a16, e.g. autoAWQ but also LLMCompressor. I tried it on Ada Nvidia card (RTX4500) and a new Blackwell RTX6000. I gave up the latter, since Blackwell support is not there.

However, it is very frustrating that I was not successful, because all my attempts to get this going on the RTX4500 failed as well. LLMCompressor seems to be experimental and AutoAWQ is unfortunately not maintained anymore and has version conflicts.

So why I am trying this:

  • Because ollama.cpp and LMStudio is not a solution for me, but all smaller Models (those which would fit in a 24 GB GPU) are quantized to formats NOT supported by vLLM but the other two products.
  • Because I cannot affort to buy one or two H100 or H200 Hopper chips for 70K or more.

So is vLLM usable for smaller models and can one build such a model or is this sofware only for large enterprises who can spend 250K for Hopper machines?
I currently have the impression, that everything is changing and nothing is really stable around quantization.

If anybody is interested I can paste here my build scripts (shell, python and dockerfile). I build on a RockyLinux machine with a Ryzen 9900, 128 GB Ram and 2 TB SSD and a RTX4500 with 24 Gb VRAM. I also have a RTX6000 but that one seems to be totally unsupported by vLLM.

Any help is greatly appreciated…

vLLM is usable for smaller models and consumer GPUs, including Ada (RTX 4500) and Ampere cards, but quantization support is hardware- and method-dependent. For INT4 (W4A16) quantization, vLLM officially supports NVIDIA GPUs with compute capability >8.0 (Ampere, Ada, Hopper, Blackwell), and you can quantize and run models using the llm-compressor workflow. However, Blackwell (RTX 6000) support for quantization is still limited, and some quantization methods (e.g., FP8) are not yet fully supported on this architecture, as you observed. AutoAWQ is not actively maintained, but llm-compressor is the recommended tool for INT4 quantization with vLLM.

For smaller models, you can use vLLM with quantized models (e.g., INT4, INT8, AWQ, GPTQ) on consumer GPUs, provided you use the correct quantization method and hardware. The process is documented and stable for INT4/INT8 on Ada/Ampere/Hopper, but the ecosystem is evolving, and some tools may have breaking changes or limited support for the latest GPUs. If you share your build scripts and error logs, the community may help debug specific issues. For step-by-step quantization and inference instructions, see the official vLLM INT4 guide.

Would you like a detailed step-by-step example for quantizing and running a small model on your RTX 4500?

Sources:

Hint: Mention @RunLLM in the post for followups.

Ok I paste one sample here:
I tried to build a 4bit quantized model from the small original model 'Apertus8b`. I used the files published at HuggingFace from ‘swiss-ai/Apertus-8B-Instruct-2509’. And I used the following code to build a vLLM image I can run on my RockyLinux machine:

#Dockerfile.build_awq_container
FROM docker.io/library/python:3.10-slim

ENV DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y git && rm -rf /var/lib/apt/lists/*

# Install AutoAWQ and dependencies
RUN pip install --upgrade pip \
 && pip install autoawq torch transformers accelerate huggingface_hub

WORKDIR /app
COPY build_autoawq_model.py /app/build_model.py

CMD ["python", "/app/build_model.py"]

Then the python file to build:

# build_autoawq_model.py
import os
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_id = os.environ["MODEL_ID"]
output_dir = os.environ["OUTPUT_DIR"]

print(f"Downloading and quantizing model: {model_id}", flush=True)

# Load and quantize the model
model = AutoAWQForCausalLM.from_pretrained(model_id)
model.quantize(wbits=4, group_size=128)

# Save quantized model
model.save_quantized(output_dir)

# Save tokenizer and config
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained(output_dir)

print(f"✅ Model quantized and saved to {output_dir}", flush=True)

And the shell scipt that starts it:

# build_apertus8b_awq_image.sh
#!/usr/bin/env bash

MODEL_ID="swiss-ai/Apertus-8B-Instruct-2509"
MODEL_NAME="apertus8b"
BUILD_CONTAINER_NAME="autoawq-builder-${MODEL_NAME}"
BUILD_IMAGE_NAME="autoawq-builder:${MODEL_NAME}"
BUILD_LOG_DIR="buildlogs"
OUTPUT_DIR="${MODEL_NAME}_AWQ_CACHE"

mkdir -p "$BUILD_LOG_DIR"
mkdir -p "$OUTPUT_DIR"

NOW=$(date '+%Y%m%d-%H%M%S')
LOG_FILE="$BUILD_LOG_DIR/build_${MODEL_NAME}_autoawq_$NOW.log"

# Build the image
podman build -f Dockerfile.build_awq_container -t "$BUILD_IMAGE_NAME" .

# Run the container
stdbuf -oL -eL podman run \
  --rm \
  --name "$BUILD_CONTAINER_NAME" \
  --volume "$(pwd)/$OUTPUT_DIR:/output:Z" \
  --env MODEL_ID="$MODEL_ID" \
  --env OUTPUT_DIR="/output" \
  "$BUILD_IMAGE_NAME" \
  2>&1 | tee "$LOG_FILE"

I got as a result:

Successfully tagged localhost/autoawq-builder:apertus8b
0b418bbdbbdea06b85ca491094091f55ff6a25be8330218785342d96131986af
/usr/local/lib/python3.10/site-packages/awq/__init__.py:21: DeprecationWarning:
I have left this message as the final dev message to help you transition.

Important Notice:
- AutoAWQ is officially deprecated and will no longer be maintained.
- The last tested configuration used Torch 2.6.0 and Transformers 4.51.3.
- If future versions of Transformers break AutoAWQ compatibility, please report the issue to the Transformers project.

Alternative:
- AutoAWQ has been adopted by the vLLM Project: https://github.com/vllm-project/llm-compressor

For further inquiries, feel free to reach out:
- X: https://x.com/casper_hansen_
- LinkedIn: https://www.linkedin.com/in/casper-hansen-804005170/

  warnings.warn(_FINAL_DEV_MESSAGE, category=DeprecationWarning, stacklevel=1)
Traceback (most recent call last):
  File "/app/build_model.py", line 2, in <module>
    from awq import AutoAWQForCausalLM
  File "/usr/local/lib/python3.10/site-packages/awq/__init__.py", line 24, in <module>
    from awq.models.auto import AutoAWQForCausalLM
  File "/usr/local/lib/python3.10/site-packages/awq/models/__init__.py", line 1, in <module>
    from .mpt import MptAWQForCausalLM
  File "/usr/local/lib/python3.10/site-packages/awq/models/mpt.py", line 1, in <module>
    from .base import BaseAWQForCausalLM
  File "/usr/local/lib/python3.10/site-packages/awq/models/base.py", line 49, in <module>
    from awq.quantize.quantizer import AwqQuantizer
  File "/usr/local/lib/python3.10/site-packages/awq/quantize/quantizer.py", line 11, in <module>
    from awq.quantize.scale import apply_scale, apply_clip
  File "/usr/local/lib/python3.10/site-packages/awq/quantize/scale.py", line 12, in <module>
    from transformers.activations import NewGELUActivation, PytorchGELUTanh, GELUActivation
ImportError: cannot import name 'PytorchGELUTanh' from 'transformers.activations' (/usr/local/lib/python3.10/site-packages/transformers/activations.py)

So I tried exactly the same thing with LLMCompressor. I tried several things, but nothing works. The errors where always of the same type:

  • library not found
  • Naming conflicts
  • Version conflicts

Hi @ThommyB , please check out the AWQ examples in llm-compressor here. While functionally it operates very similarly to AutoAWQ, the API to run a compression is completely different, as is the resultant model checkpoint. If you enounter errors with those examples, please raise an issue with your envioronment, sample script, and stack trace you’re seeing. One constraint of AWQ is that it requires user to specify mappings, so we may have to add them for apertus. Raise an issue and we can go from there

Hi Brian,
Thanks for the quick reply. I am just a bit confused with all those compression / quantization algorithms and what is supported and what is not. So if I got all those many documents I read right, for my two GPU families (Ada currently and Blackwell in the (hopefully) near future), some w4a16 compression would be ideal. I don’t know if you can confirm that or if I understood things wrong. (I am currently not interested in datacenter grade GPUs, but can imagine a machine running up to 4 of the above mentioned GPUs)

So the question is, which quantization to use. I found AutoAWQ seems like a good solution, but unfortunately it is discontinued. It looks like LLMcompressor is kind of a successor, and it is only worth to use this library and forget the others. Is this correct?

I can rerun my builds again and paste all the errors including the scripts that I used to build them, if that helps anybody. I tried the Apertus8B model. I can also try other Models such as ‘StarCoder2’ and also ‘Mixtral7B’.

I am not a ‘LLM coder’, I use these things as my foundation and build on top of it. I just like to build a couple of containers I can run. If the GPU is big enough I should be able to load two or three models, and vLLM should be exatly the right tool for that.

So I have no idea what the dependencies of a Model and the quantization is. It is maybe very naive (my lack of knowledge in this area), but if you could give me a hint what other Models I could try it would be appreciated.

So I will try to build Apertus8B again tomorrow and paste the results here.

Hi @ThommyB , check out the examples linked above and make sure you can run one of those first. There is a coder example as well.

As I mentioned above, mappings have to be added to AWQ for each model if they don’t follow the llama architecture. You can raise a ticket if whichever model you’re trying fails out due to incorrect mappings.

If you are interested in Apertus, you can also try one of our published quantized models here