Gemma 4 26B + vLLM + FP8 on H100: TTFT good, E2E latency bad

adijindal2796 · May 23, 2026, 6:19am

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).

Current setup:

Model: Gemma 4 26B (fine-tuned)
Engine: vLLM
Quantization: FP8
Hardware: H100

Observed latency:

TTFT: ~100–300 ms
E2E latency: ~3–5 seconds

The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.

I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.

Now I’m considering more serious speculative decoding approaches:

EAGLE / Medusa-style methods
Draft model based speculative decoding
Possibly training a smaller Gemma draft model

Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:

Is this kind of latency expected?
What actually moved the needle for you?
Any bottlenecks I should investigate first before going deeper into speculative decoding?

Would love to hear experiences, benchmarks, or even horror stories :))

RunLLM · May 23, 2026, 6:20am

High end-to-end latency with Gemma 4 26B on vLLM—even with FP8 quantization and H100—can be expected due to the model’s architecture and current vLLM optimizations. TTFT of 100–300 ms is typical, but E2E latency of 3–5 seconds is on the higher side for “4B-ish” effective compute, likely because Gemma 4’s heterogeneous attention heads force vLLM to use slower Triton attention kernels instead of FlashAttention, which is a known bottleneck (see issue). N-gram speculative decoding provides only modest speedups and is not effective for high-throughput or large-batch scenarios (speculative decoding docs, ngram performance discussion).

To reduce latency, the most impactful approaches are model-based speculative decoding methods like MTP (Multi-Token Prediction) with Gemma 4 assistant models, or EAGLE/Medusa-style methods. Gemma 4 supports MTP speculative decoding with assistant checkpoints, which can yield 1.5–3x speedup in generation throughput and lower E2E latency (Gemma 4 MTP guide, MTP docs). Before deeper speculative decoding work, check if you’re hitting the Triton attention fallback, and benchmark with MTP using the official assistant model for your Gemma 4 variant. Also, ensure chunked prefill and prefix caching are tuned for your workload (performance tuning). If you need further detail or code examples, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
How to use Gemma 4 with the new MTP drafters? General	8	2672	May 14, 2026
OOM Trying to run Gemma 4 31B NVFP4 on 2x16GB Model Support	4	69	May 31, 2026
Gemma3 on a T4 GPU General	5	862	May 19, 2025
Suggestion to improve inferencing speed Model Support	17	767	March 11, 2026
Gemma 3 Quantization General	5	1552	June 21, 2025

Gemma 4 26B + vLLM + FP8 on H100: TTFT good, E2E latency bad

Related topics