How to use Gemma 4 with the new MTP drafters?

Hi! Google just announced a way to speed up Gemma 4 with Multi-Token Prediction. This introduces a new “drafter” set of models. I would like to know if this is possible to set up with vLLM? vLLM has docs on MTP but it doesn’t mention Gemma 4. On the other hand, vLLM’s Gemma 4 guide doesn’t mention MTP. I saw SpeculativeConfig, which looks relevant, but I am not sure how to configure it.

Any advice would be much appreciated! :blush:

Yes, you can use Multi-Token Prediction (MTP) speculative decoding with Gemma 4 models in vLLM. Recent updates to vLLM added support for MTP on Gemma 4, including both dense and MoE variants. To enable it, use the --speculative-config flag with method: mtp and set num_speculative_tokens (e.g., 1) in your vllm serve command. No separate drafter model is needed—Gemma 4’s native MTP layers are used directly. Example:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This setup is confirmed in the Gemma 4 Usage Guide and GLM-4.X LLM Usage Guide, which both recommend num_speculative_tokens=1 for best throughput. For more details on the config schema, see the speculative decoding docs.

Would you like more detail on advanced configuration or performance tuning?

Sources:

Hint: Mention @RunLLM in the post for followups.

Looks like the correct approach as per the linked docs is:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'

I’m not sure why the config mentions ‘gg-hf-am/’ instead of ‘google/’. I will try ‘google/gemma-4-E4B-it-assistant’ and see if it works.