Gemma3 on a T4 GPU

danielfl · May 19, 2025, 2:12pm

Hello everyone,

I just got access to multiple T4 GPUs to test gemma3 12b for my local users. Did someone have any experience with using this card for gemma3 before?

Already checked the supported hardware and models and some github issues where people are not successful to run it. And the support for gemma3 gguf is still not available. But maybe someone has any trick and could help.

Thanks.

RunLLM · May 19, 2025, 2:13pm

Running Gemma3 12B on T4 GPUs is challenging due to high VRAM requirements and numerical stability issues with float16 (the only low-precision type supported by T4). Users report that Gemma3 models often generate empty outputs or fail to run on T4, as Gemma3 is unstable with float16 and requires bfloat16 or float32 for reliable inference—neither of which is natively supported on T4. Attempts to use float32 result in out-of-memory errors, as memory usage doubles compared to float16. There are no confirmed workarounds or tricks to make Gemma3 12B reliably work on T4 GPUs, and GGUF support for Gemma3 is still experimental and not fully available in vLLM as of now. See discussions in vllm/issues/15097, vllm/issues/16489, and vllm/pull/14766.

If you need to run Gemma3 12B, consider using GPUs with bfloat16 support (Ampere or newer) or use a smaller model variant. Would you like more detail on possible configuration tweaks or alternative models?

Sources:

aarnphm · May 19, 2025, 2:15pm

T4 is too old to run Gemma 3 unfortunately afaik.

Sliding windows attention requires at least ampere generation of GPUs here.

RunLLM · May 19, 2025, 2:15pm

Correct—T4 (compute capability 7.5, Turing architecture) does not support bfloat16, which is required for stable Gemma 3 inference. Gemma 3’s sliding window attention and numerical stability are only reliable on Ampere (compute capability 8.0+) or newer GPUs. On T4, float16 leads to empty or unstable outputs, and float32 exceeds memory limits for 12B models. See vllm/issues/15097 and vllm/issues/16489.

Would you like more technical details or suggestions for alternative models?

Sources:

danielfl · May 19, 2025, 2:21pm

Thank you. I will try to use llama3.1 or any other older model that may fit for this hardware..

RunLLM · May 19, 2025, 2:21pm

That’s a good approach. Llama 3.1 8B or similar models are known to run successfully on T4 GPUs using float16/half precision, as confirmed by multiple users (vllm/issues/11569, vllm/issues/1400). Just ensure you set dtype=“half” and adjust max_model_len and gpu_memory_utilization to avoid OOM errors.

Would you like a sample configuration or more tuning tips for T4?

Sources:

Topic		Replies	Views
Gemma 3 Quantization General	5	783	June 21, 2025
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	296	July 25, 2025
Not able to run google/gemma-3n-E4B-it General	3	66	September 22, 2025
Running Gemma 3 on multi-chip TPU failure Google TPU Support	5	382	May 1, 2025
Which gemma-3 27B for 32GB Vram General	9	165	September 18, 2025

Gemma3 on a T4 GPU

Related topics