Gemma 4 26B + vLLM + FP8 on H100: TTFT good, E2E latency bad

Recently fine-tuned a Gemma 4 26B model, and I’m seeing surprisingly high end-to-end latency despite the effective inference footprint being much smaller (~4B-ish behavior during serving).

Current setup:

  • Model: Gemma 4 26B (fine-tuned)

  • Engine: vLLM

  • Quantization: FP8

  • Hardware: H100

Observed latency:

  • TTFT: ~100–300 ms

  • E2E latency: ~3–5 seconds

The TTFT seems reasonable, but the overall generation latency feels disproportionately high for the effective serving size.

I already experimented with vLLM’s n-gram speculative decoding, but honestly didn’t see meaningful gains.

Now I’m considering more serious speculative decoding approaches:

  • EAGLE / Medusa-style methods

  • Draft model based speculative decoding

  • Possibly training a smaller Gemma draft model

Curious to hear from others who’ve worked with Gemma 4 or large distilled/fine-tuned models:

  • Is this kind of latency expected?

  • What actually moved the needle for you?

  • Any bottlenecks I should investigate first before going deeper into speculative decoding?

Would love to hear experiences, benchmarks, or even horror stories :))

High end-to-end latency with Gemma 4 26B on vLLM—even with FP8 quantization and H100—can be expected due to the model’s architecture and current vLLM optimizations. TTFT of 100–300 ms is typical, but E2E latency of 3–5 seconds is on the higher side for “4B-ish” effective compute, likely because Gemma 4’s heterogeneous attention heads force vLLM to use slower Triton attention kernels instead of FlashAttention, which is a known bottleneck (see issue). N-gram speculative decoding provides only modest speedups and is not effective for high-throughput or large-batch scenarios (speculative decoding docs, ngram performance discussion).

To reduce latency, the most impactful approaches are model-based speculative decoding methods like MTP (Multi-Token Prediction) with Gemma 4 assistant models, or EAGLE/Medusa-style methods. Gemma 4 supports MTP speculative decoding with assistant checkpoints, which can yield 1.5–3x speedup in generation throughput and lower E2E latency (Gemma 4 MTP guide, MTP docs). Before deeper speculative decoding work, check if you’re hitting the Triton attention fallback, and benchmark with MTP using the official assistant model for your Gemma 4 variant. Also, ensure chunked prefill and prefix caching are tuned for your workload (performance tuning). If you need further detail or code examples, let me know.

Sources:

Hint: Mention @RunLLM in the post for followups.