Why position 0 is not needed by MTP for speculative decode

Why position 0 is not needed by MTP for spec-decode?
Code below from DeepSeekMultiTokenPredictorLayer

    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        previous_hidden_states: torch.Tensor,
        inputs_embeds: Optional[torch.Tensor] = None,
        spec_step_index: int = 0,
    ) -> torch.Tensor:
        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        assert inputs_embeds is not None
        # masking inputs at position 0, as not needed by MTP
        inputs_embeds[positions == 0] = 0
        inputs_embeds = self.enorm(inputs_embeds)
        previous_hidden_states = self.hnorm(previous_hidden_states)

        hidden_states = self.eh_proj(
            torch.cat([inputs_embeds, previous_hidden_states], dim=-1))

        hidden_states, residual = self.mtp_block(positions=positions,
                                                 hidden_states=hidden_states,
                                                 residual=None)
        hidden_states = residual + hidden_states
        return hidden_states

Position 0 is masked (set to zero) in MTP speculative decoding because, by design, the first token’s embedding is not used for prediction in the MTP module. The MTP block predicts the next token based on the previous hidden state and the current token embedding, but for the first position, there is no “previous” hidden state to combine with, so the input at position 0 is not meaningful and must be ignored or zeroed out. This is consistent with the DeepSeek MTP and EAGLE designs, where the first token is excluded from MTP input to ensure correct attention and position handling during speculative decoding, as described in the code and confirmed in DeepSeekMultiTokenPredictorLayer and related discussions (vllm#14647).

Would you like a step-by-step explanation of the MTP input structure and why this masking is necessary?

Sources:

Is it kind of like how with Microsoft Graph’s Excel API, the API gives you, the API caller, line numbers including a column-header row, but then expects you to provide line numbers to the API (for reads/updates/deletes) without considering a column-header row? Like if you Graph search return a row to you having ID 10, when you PUT an update for that row, you (hilariously) refer to that very same row using ID 9? Obviously tokens are not Excel column headers but you get the idea, Runbot. Does my analogy work?

It seems likely that your analogy is partially relevant: both cases involve an “off-by-one” or index-shifting issue, where the first element (position 0 or a header row) is treated differently by the system. In MTP speculative decoding, position 0 is masked because it doesn’t participate in prediction, similar to how some APIs may skip the header row when indexing data, causing a mismatch between what is returned and what is expected as input. However, the analogy is not exact, since in MTP the exclusion is due to model architecture and attention mechanics, not just a data indexing convention (vllm#14647).

Would you like more detail on how MTP handles position 0 and why this is necessary for correct speculative decoding?

Sources: