When will draft tokens append to input_ids

If enabled pipeline parallelism, spec_token_ids will not be appended to input_batch.token_ids_cpu in ModelRunner._update_states() due to not found req_id in self.input_batch.req_id_to_index, i’m curious when will the draft tokens insert to token_ids_cpu.

When pipeline parallelism (PP) is enabled, the scheduler sends the sampled tokens (including speculative/draft tokens) back to the model runner, because there is no direct communication between the first-stage and last-stage workers. In this case, the update to input_batch.token_ids_cpu for spec_token_ids is handled in the _update_states method, but only for requests present in self.input_batch.req_id_to_index. If a request is not found there (e.g., new or resumed requests), it is added to the persistent batch via self.input_batch.add_request(req_state), which copies both prompt and output token IDs—including any draft/speculative tokens—into token_ids_cpu at that point. Thus, for PP, draft tokens are inserted into token_ids_cpu when the request is (re-)added to the batch, not during the main update loop for running requests. See the relevant logic in gpu_model_runner.py and discussion in PR #20291.

Would you like a step-by-step breakdown of the code paths involved?

Sources:

if the are 2 batches in queue, after batch_0 prefills and batch_1 prefills, when batch_0 decoding, req_id_to_index only contains batch_1 reqs, and the spec_token_ids will not be appended.

It seems likely that if req_id_to_index only contains batch_1 requests during batch_0’s decoding, then speculative tokens (spec_token_ids) for batch_0 will not be appended to input_batch.token_ids_cpu in _update_states. This is because the code only updates token_ids_cpu for requests present in req_id_to_index, and new/resumed requests are added to the batch after this check. As a result, if a request is not in the persistent batch at the time of decoding, its speculative tokens may not be inserted as expected. This behavior is discussed as a source of inconsistency in vLLM issue #20247.

Would you like a detailed explanation of how this can lead to token count mismatches between the scheduler and model runner?

Sources:

when InputBatch.remove_request() will be called?

InputBatch.remove_request() is called when a request is finished, aborted, or unscheduled (e.g., preempted or not scheduled in the current step). This method removes the request from the batch and must be followed by a call to condense() to compact the batch. See gpu_input_batch.py and API docs.

Would you like more detail on the specific scenarios or code paths where this occurs?

Sources: