Which gemma-3 27B for 32GB Vram

RunLLM · September 17, 2025, 9:27pm

For a 32GB VRAM GPU, the best option for running Google Gemma-3 27B with vLLM is to use a quantized model (such as GPTQ or GGUF in 4-bit or 8-bit) to fit within memory constraints. vLLM supports GGUF models, but GGUF support is highly experimental and may require additional steps, such as providing the original Hugging Face config and tokenizer, and using the latest vLLM and transformers from source. Not all GGUF variants (especially vision or multi-modal) are fully supported—text-only GGUF models are more suitable. For best stability and performance, GPTQ-quantized models (e.g., ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g) are commonly used and have been reported to work on 24–32GB GPUs with proper settings (e.g., --max-model-len 8192, --gpu-memory-utilization 0.99) (source, source, source).

GGUF models can be run with vLLM, but you must use the tokenizer from the base model and may need to specify --hf-config-path and --tokenizer explicitly. GGUF support for Gemma-3 is still being finalized, and some users report issues or empty outputs, especially with larger variants or on certain GPUs. For production or more reliable use, GPTQ or AWQ quantized models are currently more suitable than GGUF (source, source).

Would you like more detail or example launch commands?

Sources:

Topic		Replies	Views
Gemma 3 Quantization General	5	1533	June 21, 2025
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	918	July 25, 2025
Gemma3 on a T4 GPU General	5	854	May 19, 2025
Issue serving gemma3-27b-it Model Support	1	597	September 19, 2025
Gemma-3-QAT fp4 with rocm General	3	296	October 13, 2025

Which gemma-3 27B for 32GB Vram

Related topics