For a 32GB VRAM GPU, the best option for running Google Gemma-3 27B with vLLM is to use a quantized model (such as GPTQ or GGUF in 4-bit or 8-bit) to fit within memory constraints. vLLM supports GGUF models, but GGUF support is highly experimental and may require additional steps, such as providing the original Hugging Face config and tokenizer, and using the latest vLLM and transformers from source. Not all GGUF variants (especially vision or multi-modal) are fully supported—text-only GGUF models are more suitable. For best stability and performance, GPTQ-quantized models (e.g., ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g) are commonly used and have been reported to work on 24–32GB GPUs with proper settings (e.g., --max-model-len 8192, --gpu-memory-utilization 0.99) (source, source, source).
GGUF models can be run with vLLM, but you must use the tokenizer from the base model and may need to specify --hf-config-path and --tokenizer explicitly. GGUF support for Gemma-3 is still being finalized, and some users report issues or empty outputs, especially with larger variants or on certain GPUs. For production or more reliable use, GPTQ or AWQ quantized models are currently more suitable than GGUF (source, source).
Would you like more detail or example launch commands?
Sources: