Hi! Google just announced a way to speed up Gemma 4 with Multi-Token Prediction. This introduces a new “drafter” set of models. I would like to know if this is possible to set up with vLLM? vLLM has docs on MTP but it doesn’t mention Gemma 4. On the other hand, vLLM’s Gemma 4 guide doesn’t mention MTP. I saw SpeculativeConfig, which looks relevant, but I am not sure how to configure it.
Yes, you can use Multi-Token Prediction (MTP) speculative decoding with Gemma 4 models in vLLM. Recent updates to vLLM added support for MTP on Gemma 4, including both dense and MoE variants. To enable it, use the --speculative-config flag with method: mtp and set num_speculative_tokens (e.g., 1) in your vllm serve command. No separate drafter model is needed—Gemma 4’s native MTP layers are used directly. Example:
What version of vllm supports this feature? On 0.20.1 (docker, locally built) I get the error:
Value error, The checkpoint you are trying to load has model type gemma4_assistant but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
Gemma 4 MTP support is not included in vLLM 0.20.2 or earlier stable releases. The mtp method for Gemma 4 was added after 0.20.2 and is only available in vLLM nightly builds or after merging PR #41745. If you use 0.20.2, you will see errors like “Unsupported speculative method: ‘mtp’” or initialization failures with Gemma 4 assistant models. To use MTP with Gemma 4, install the latest vLLM nightly or build from source after the relevant PRs were merged (details).
Would you like instructions for installing the nightly build or building from source?
Despite the documentation, Gemma 4 MTP support is not present in vLLM 0.20.2. The relevant PRs (#41745 and others) were merged after the 0.20.2 release branch was cut, so the feature is only available in nightly builds or main branch after those merges. This is confirmed by multiple user reports and maintainers in this issue.
Would you like step-by-step instructions for installing the nightly build?