Why position 0 is not needed by MTP for speculative decode

TianTengya · May 31, 2025, 3:09pm

Why position 0 is not needed by MTP for spec-decode?
Code below from DeepSeekMultiTokenPredictorLayer

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: Optional[torch.Tensor] = None,
        spec_step_index: int = 0,
    ) -> torch.Tensor:
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        assert inputs_embeds is not None
        # masking inputs at position 0, as not needed by MTP
        inputs_embeds[positions == 0] = 0
        inputs_embeds = self.enorm(inputs_embeds)
        previous_hidden_states = self.hnorm(previous_hidden_states)

        hidden_states = self.eh_proj(
            torch.cat([inputs_embeds, previous_hidden_states], dim=-1))

        hidden_states, residual = self.mtp_block(positions=positions,
                                                 hidden_states=hidden_states,
                                                 residual=None)
        hidden_states = residual + hidden_states
        return hidden_states

RunLLM · May 31, 2025, 3:10pm

Position 0 is masked (set to zero) in MTP speculative decoding because, by design, the first token’s embedding is not used for prediction in the MTP module. The MTP block predicts the next token based on the previous hidden state and the current token embedding, but for the first position, there is no “previous” hidden state to combine with, so the input at position 0 is not meaningful and must be ignored or zeroed out. This is consistent with the DeepSeek MTP and EAGLE designs, where the first token is excluded from MTP input to ensure correct attention and position handling during speculative decoding, as described in the code and confirmed in DeepSeekMultiTokenPredictorLayer and related discussions (vllm#14647).

Would you like a step-by-step explanation of the MTP input structure and why this masking is necessary?

Sources:

chris · June 2, 2025, 12:15pm

Is it kind of like how with Microsoft Graph’s Excel API, the API gives you, the API caller, line numbers including a column-header row, but then expects you to provide line numbers to the API (for reads/updates/deletes) without considering a column-header row? Like if you Graph search return a row to you having ID 10, when you PUT an update for that row, you (hilariously) refer to that very same row using ID 9? Obviously tokens are not Excel column headers but you get the idea, Runbot. Does my analogy work?

RunLLM · June 2, 2025, 12:16pm

It seems likely that your analogy is partially relevant: both cases involve an “off-by-one” or index-shifting issue, where the first element (position 0 or a header row) is treated differently by the system. In MTP speculative decoding, position 0 is masked because it doesn’t participate in prediction, similar to how some APIs may skip the header row when indexing data, causing a mismatch between what is returned and what is expected as input. However, the analogy is not exact, since in MTP the exclusion is due to model architecture and attention mechanics, not just a data indexing convention (vllm#14647).

Would you like more detail on how MTP handles position 0 and why this is necessary for correct speculative decoding?

Sources:

vllm#14647

Topic		Replies	Views
Why k=1 can achieve inference acceleration for DeepSeek MTP spec decode General	1	15	May 31, 2025
[Developer Question] How are the position indices updated? General	4	24	March 22, 2025
Gemma 3 prefix caching in case of multimodal prompts Model Support	4	46	May 22, 2025
Why remove bonus token of requset in draft model? Speculative Decoding	0	29	March 30, 2025
Why zero_point is set False in gptq_marlin? General	0	15	March 28, 2025

Why position 0 is not needed by MTP for speculative decode

Related topics