Hi! Google just announced a way to speed up Gemma 4 with Multi-Token Prediction. This introduces a new “drafter” set of models. I would like to know if this is possible to set up with vLLM? vLLM has docs on MTP but it doesn’t mention Gemma 4. On the other hand, vLLM’s Gemma 4 guide doesn’t mention MTP. I saw SpeculativeConfig, which looks relevant, but I am not sure how to configure it.
Any advice would be much appreciated! 
Yes, you can use Multi-Token Prediction (MTP) speculative decoding with Gemma 4 models in vLLM. Recent updates to vLLM added support for MTP on Gemma 4, including both dense and MoE variants. To enable it, use the --speculative-config flag with method: mtp and set num_speculative_tokens (e.g., 1) in your vllm serve command. No separate drafter model is needed—Gemma 4’s native MTP layers are used directly. Example:
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
This setup is confirmed in the Gemma 4 Usage Guide and GLM-4.X LLM Usage Guide, which both recommend num_speculative_tokens=1 for best throughput. For more details on the config schema, see the speculative decoding docs.
Would you like more detail on advanced configuration or performance tuning?
Sources:
Hint: Mention @RunLLM in the post for followups.
Looks like the correct approach as per the linked docs is:
vllm serve google/gemma-4-31B-it \
--tensor-parallel-size 2 \
--max-model-len 8192 \
--speculative-config '{"model": "gg-hf-am/gemma-4-31B-it-assistant", "num_speculative_tokens": 4}'
I’m not sure why the config mentions ‘gg-hf-am/’ instead of ‘google/’. I will try ‘google/gemma-4-E4B-it-assistant’ and see if it works.