Which is the best google gemma-3 27B model for vllm when having 32gb vram? Can vllm run gguf or which models are more suitable?
For a 32GB VRAM GPU, the best option for running Google Gemma-3 27B with vLLM is to use a quantized model (such as GPTQ or GGUF in 4-bit or 8-bit) to fit within memory constraints. vLLM supports GGUF models, but GGUF support is highly experimental and may require additional steps, such as providing the original Hugging Face config and tokenizer, and using the latest vLLM and transformers from source. Not all GGUF variants (especially vision or multi-modal) are fully supported—text-only GGUF models are more suitable. For best stability and performance, GPTQ-quantized models (e.g., ISTA-DASLab/gemma-3-27b-it-GPTQ-4b-128g) are commonly used and have been reported to work on 24–32GB GPUs with proper settings (e.g., --max-model-len 8192
, --gpu-memory-utilization 0.99
) (source, source, source).
GGUF models can be run with vLLM, but you must use the tokenizer from the base model and may need to specify --hf-config-path
and --tokenizer
explicitly. GGUF support for Gemma-3 is still being finalized, and some users report issues or empty outputs, especially with larger variants or on certain GPUs. For production or more reliable use, GPTQ or AWQ quantized models are currently more suitable than GGUF (source, source).
Would you like more detail or example launch commands?
Sources:
Which is faster in concurrent requests, unquantized model or quantized model GPTQ? At least compared GPTQ to unquantized model there seems to be a short lag before the answer, with unquantized the answeer from llm is instant. Why is that?
Quantized GPTQ models are often slower than unquantized models for concurrent requests in vLLM. This is due to additional overhead from dequantization and less optimized kernels for quantized inference, especially during the first token generation and with larger batch sizes. Users consistently report that unquantized models respond more quickly and that GPTQ models introduce a noticeable initial lag (source, source, source).
The lag occurs because quantized models require extra computation to convert (dequantize) weights back to higher precision for each request, and current kernel implementations are not as efficient as those for unquantized (FP16/BF16) models. This overhead is most pronounced at the start of generation and when handling many concurrent requests (source, source).
Would you like a deeper technical breakdown or performance benchmarks?
Sources:
So what is the benefit of using quantized models if they are slower if the goal is to serve as much as possible simultaneously. So as many concurrent requests as possible? Is the original version then best? If there is enough vram…
The main benefit of quantized models is reduced memory usage, allowing you to fit larger models or serve longer context lengths on limited VRAM. However, if your goal is maximum throughput and lowest latency for many concurrent requests—and you have enough VRAM—unquantized (FP16/BF16) models are usually faster and more efficient for concurrent serving (source, source).
Quantization is most useful when VRAM is a bottleneck. If memory is not a constraint, the original (unquantized) model is generally best for high concurrency and speed (source).
Would you like more detail on when quantization is recommended?
Sources:
alright then, can I run unquantized version of gemma3-27B with 2 5090?
Yes, you can run the unquantized version of Gemma3-27B on two NVIDIA 5090 GPUs using vLLM by leveraging tensor parallelism. Set --tensor-parallel-size 2
to split the model across both GPUs, as this is the recommended approach when a model does not fit on a single GPU (source, source).
Make sure your combined VRAM is sufficient (Gemma3-27B unquantized typically requires >48GB total), and set the environment variable CUDA_VISIBLE_DEVICES=0,1
to select the GPUs. Example command:
vllm serve --model google/gemma-3-27b-it --tensor-parallel-size 2
Would you like more details on configuration or memory requirements?
Sources:
Yes, quantization affects the concurrency; if that particular quantization method is not directly supported by your GPU. You can probably play around with vLLM parameters like this:
vllm serve provider/modelname -–max-model-len 4096 -–gpu-memory-utilization 0.9 -–max-num-seqs 8
-
--max-model-len 4096: Limits the context window to 4K tokens
-
--gpu-memory-utilization 0.9: Uses 90% of available GPU memory
-
--max-num-seqs 8: Limits concurrent sequences for stable performance
If you have multiple GPUs, then you can also do tensor parallelization: Parallelism and Scaling - vLLM
If nothing works then, quantization is probably the only way to run this 27B model on a GPU with 32GB vRAM.
Also, GGUF quantized models are primarily suitable for serving from CPU.
You can checkout the int4 quantized model: gaunernst/gemma-3-27b-it-int4-awq · Hugging Face
Your summary is correct: quantization is often necessary to run large models like Gemma-3 27B on a 32GB GPU, as unquantized versions typically exceed this memory limit. Adjusting parameters like --max-model-len
, --gpu-memory-utilization
, and --max-num-seqs
can help, but may not be enough for unquantized 27B models (source). Tensor parallelism can split the model across multiple GPUs if available (source).
GGUF models are mainly optimized for CPU inference and are not as performant on GPU with vLLM. For GPU serving on 32GB VRAM, int4 quantized models like AWQ or GPTQ are the most practical choice (source).
Would you like more detail on parameter tuning or quantized model selection?
Sources: