Hello everyone,
I just got access to multiple T4 GPUs to test gemma3 12b for my local users. Did someone have any experience with using this card for gemma3 before?
Already checked the supported hardware and models and some github issues where people are not successful to run it. And the support for gemma3 gguf is still not available. But maybe someone has any trick and could help.
Thanks.
Running Gemma3 12B on T4 GPUs is challenging due to high VRAM requirements and numerical stability issues with float16 (the only low-precision type supported by T4). Users report that Gemma3 models often generate empty outputs or fail to run on T4, as Gemma3 is unstable with float16 and requires bfloat16 or float32 for reliable inference—neither of which is natively supported on T4. Attempts to use float32 result in out-of-memory errors, as memory usage doubles compared to float16. There are no confirmed workarounds or tricks to make Gemma3 12B reliably work on T4 GPUs, and GGUF support for Gemma3 is still experimental and not fully available in vLLM as of now. See discussions in vllm/issues/15097, vllm/issues/16489, and vllm/pull/14766.
If you need to run Gemma3 12B, consider using GPUs with bfloat16 support (Ampere or newer) or use a smaller model variant. Would you like more detail on possible configuration tweaks or alternative models?
Sources:
T4 is too old to run Gemma 3 unfortunately afaik.
Sliding windows attention requires at least ampere generation of GPUs here.
Correct—T4 (compute capability 7.5, Turing architecture) does not support bfloat16, which is required for stable Gemma 3 inference. Gemma 3’s sliding window attention and numerical stability are only reliable on Ampere (compute capability 8.0+) or newer GPUs. On T4, float16 leads to empty or unstable outputs, and float32 exceeds memory limits for 12B models. See vllm/issues/15097 and vllm/issues/16489.
Would you like more technical details or suggestions for alternative models?
Sources:
Thank you. I will try to use llama3.1 or any other older model that may fit for this hardware..
That’s a good approach. Llama 3.1 8B or similar models are known to run successfully on T4 GPUs using float16/half precision, as confirmed by multiple users (vllm/issues/11569, vllm/issues/1400). Just ensure you set dtype=“half” and adjust max_model_len and gpu_memory_utilization to avoid OOM errors.
Would you like a sample configuration or more tuning tips for T4?
Sources: