How to use Gemma 4 with the new MTP drafters?

Hi! Google just announced a way to speed up Gemma 4 with Multi-Token Prediction. This introduces a new “drafter” set of models. I would like to know if this is possible to set up with vLLM? vLLM has docs on MTP but it doesn’t mention Gemma 4. On the other hand, vLLM’s Gemma 4 guide doesn’t mention MTP. I saw SpeculativeConfig, which looks relevant, but I am not sure how to configure it.

Any advice would be much appreciated! :blush:

Yes, you can use Multi-Token Prediction (MTP) speculative decoding with Gemma 4 models in vLLM. Recent updates to vLLM added support for MTP on Gemma 4, including both dense and MoE variants. To enable it, use the --speculative-config flag with method: mtp and set num_speculative_tokens (e.g., 1) in your vllm serve command. No separate drafter model is needed—Gemma 4’s native MTP layers are used directly. Example:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This setup is confirmed in the Gemma 4 Usage Guide and GLM-4.X LLM Usage Guide, which both recommend num_speculative_tokens=1 for best throughput. For more details on the config schema, see the speculative decoding docs.

Would you like more detail on advanced configuration or performance tuning?

Sources:

Hint: Mention @RunLLM in the post for followups.

Looks like the correct approach as per the linked docs is:

vllm serve google/gemma-4-31B-it \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'

I’m not sure why the config mentions ‘gg-hf-am/’ instead of ‘google/’. I will try ‘google/gemma-4-E4B-it-assistant’ and see if it works.

It was my bad, @gaurangblaze ! I pushed the docs updates for MTP.

where it is `gg-hf-qat/` it should be `google/`. I will fix that asap!!!

What version of vllm supports this feature? On 0.20.1 (docker, locally built) I get the error:

Value error, The checkpoint you are trying to load has model type gemma4_assistant but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

Thanks!

submitted fix: correct HuggingFace repo prefix in Gemma 4 MTP speculative decoding docs by lucianommartins · Pull Request #444 · vllm-project/recipes · GitHub , @gaurangblaze

@Huehnerbrust - you can find the directions here (including packages versions etc - but looks like you are missing a transformers package update): Gemma 4 Usage Guide - vLLM Recipes

Hi, is this definitely released? Tried using 0.20.2 but got a response back stating MTP wasn’t recognised. Tried with binary and docker image @RunLLM

Tried 26B model as well as 31B

vllm serve google/gemma-4-26B-A4B-it \
-–tensor-parallel-size 4 
-–max-model-len 8192 
-–gpu-memory-utilization 0.90 
-–speculative-config ‘{“method”:“mtp”,“model”:“google/gemma-4-26B-A4B-it-assistant”,“num_speculative_tokens”:4}’

Gemma 4 MTP support is not included in vLLM 0.20.2 or earlier stable releases. The mtp method for Gemma 4 was added after 0.20.2 and is only available in vLLM nightly builds or after merging PR #41745. If you use 0.20.2, you will see errors like “Unsupported speculative method: ‘mtp’” or initialization failures with Gemma 4 assistant models. To use MTP with Gemma 4, install the latest vLLM nightly or build from source after the relevant PRs were merged (details).

Would you like instructions for installing the nightly build or building from source?

Sources:

Hint: Mention @RunLLM in the post for followups.

Despite the documentation, Gemma 4 MTP support is not present in vLLM 0.20.2. The relevant PRs (#41745 and others) were merged after the 0.20.2 release branch was cut, so the feature is only available in nightly builds or main branch after those merges. This is confirmed by multiple user reports and maintainers in this issue.

Would you like step-by-step instructions for installing the nightly build?

Sources:

Hint: Mention @RunLLM in the post for followups.