Gemma 3 Quantization

pill2025 · June 20, 2025, 2:42pm

I have the GPTQModel quantization steps at GPTQModel - vLLM for Gemma 3 27B and Gemma 3 12B on Hugging Face. However, I get errors and it doesn’t work.

Does vLLM support the Gemma 3 models for quantization?

Note: I have gotten the 4 bit bits and bytes working with the Gemma 3 models.

RunLLM · June 20, 2025, 2:42pm

Yes, vLLM supports Gemma 3 models for quantization, including GPTQModel and 4-bit bitsandbytes, but with caveats. Official documentation and recent updates confirm support for Gemma 3 quantized models, including 4-bit and 8-bit GPTQModel quantization, as well as bitsandbytes quantization, provided you use the latest vLLM and Hugging Face Transformers from source. However, users have reported issues with some quantized checkpoints (e.g., GGUF format or certain QAT/FP8 variants), and you may need to specify additional arguments or use local model paths for GGUF files. For best results, use the latest vLLM main branch and Transformers ≥4.50, and ensure your quantized model is in a supported format (not all GGUF or QAT checkpoints are fully supported yet) (docs, supported models, issue 15959, issue 14766).

If you are still getting errors, please specify the error message and model format. Some issues are due to missing config files, unsupported GGUF variants, or needing to pass extra arguments like --hf-config-path and --tokenizer. Would you like a step-by-step troubleshooting guide or example command for your specific case?

Sources:

pill2025 · June 20, 2025, 3:38pm

Will the latest vLLM docker container work? https://hub.docker.com/r/vllm/vllm-openai/tags has v0.9.1 as the latest from 10 days ago.

Or do I need use the source and build the container myself?

RunLLM · June 20, 2025, 3:39pm

The latest vLLM Docker container (v0.9.1) may not include the very latest fixes and features from the main branch, especially for newly released models or quantization methods like those for Gemma 3. If you need support for the newest quantized Gemma 3 models or want the most up-to-date compatibility, you should build the container from source using the main branch (docs, dockerfile instructions, issue 15959).

Would you like step-by-step instructions for building a custom Docker image from source?

Sources:

pill2025 · June 21, 2025, 1:04pm

If you have the step-by-step instructions that would be helpful.

Thanks.

RunLLM · June 21, 2025, 1:04pm

To build a custom vLLM Docker image from source (main branch) for the latest model and quantization support, follow these steps (docs, dockerfile instructions):

Clone vLLM main branch:

git clone https://github.com/vllm-project/vllm.git
cd vllm

(Optional) Update Transformers for latest model support:
Edit requirements/common.txt and replace the transformers line with:

transformers @ git+https://github.com/huggingface/transformers.git

Or add a Dockerfile layer to install from source.

Build Docker image:

DOCKER_BUILDKIT=1 docker build . --target vllm-openai --tag vllm/vllm-openai:custom --file docker/Dockerfile

You can add build args like --build-arg max_jobs=8 --build-arg nvcc_threads=2 for faster builds.

Run the container:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --env "HUGGING_FACE_HUB_TOKEN=<your_token>" \
  --ipc=host \
  vllm/vllm-openai:custom \
  --model <your_model> [other vLLM args]

Let me know if you want a sample Dockerfile for custom dependencies or more details on any step.

Sources:

Topic		Replies	Views
Gemma3 on a T4 GPU General	5	106	May 19, 2025
How to use speculative decoding? Speculative Decoding	3	86	May 1, 2025
GGUF quantized models Inference support Quantization	0	110	March 25, 2025
RunBot's math-to-text on NVIDIA NeMo Framework AutoModel LoRA	11	28	May 19, 2025
MoE quantization Quantization	3	132	May 8, 2025

Gemma 3 Quantization

Related topics