How to use Gemma 4 with the new MTP drafters?

gaurangblaze · May 6, 2026, 4:20am

Hi! Google just announced a way to speed up Gemma 4 with Multi-Token Prediction. This introduces a new “drafter” set of models. I would like to know if this is possible to set up with vLLM? vLLM has docs on MTP but it doesn’t mention Gemma 4. On the other hand, vLLM’s Gemma 4 guide doesn’t mention MTP. I saw SpeculativeConfig, which looks relevant, but I am not sure how to configure it.

Any advice would be much appreciated!

RunLLM · May 6, 2026, 4:21am

Yes, you can use Multi-Token Prediction (MTP) speculative decoding with Gemma 4 models in vLLM. Recent updates to vLLM added support for MTP on Gemma 4, including both dense and MoE variants. To enable it, use the --speculative-config flag with method: mtp and set num_speculative_tokens (e.g., 1) in your vllm serve command. No separate drafter model is needed—Gemma 4’s native MTP layers are used directly. Example:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This setup is confirmed in the Gemma 4 Usage Guide and GLM-4.X LLM Usage Guide, which both recommend num_speculative_tokens=1 for best throughput. For more details on the config schema, see the speculative decoding docs.

Would you like more detail on advanced configuration or performance tuning?

Sources:

Hint: Mention @RunLLM in the post for followups.

gaurangblaze · May 6, 2026, 4:51am

Looks like the correct approach as per the linked docs is:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'

I’m not sure why the config mentions ‘gg-hf-am/’ instead of ‘google/’. I will try ‘google/gemma-4-E4B-it-assistant’ and see if it works.

Topic		Replies	Views
Does vllm support draft model use tp>1 when I use speculative decoding Speculative Decoding	1	169	July 29, 2025
Speculative decoding is currently incompatible with pipeline parallelism General	3	859	July 8, 2025
Qwen3.5-27B-FP8 Speculative Decoding Speculative Decoding	2	1642	April 11, 2026
How to serve gemma-4-31b-it General	2	694	April 2, 2026
How to use speculative decoding? Speculative Decoding	3	920	May 1, 2025

How to use Gemma 4 with the new MTP drafters?

Related topics