If enabled pipeline parallelism, spec_token_ids will not be appended to input_batch.token_ids_cpu in ModelRunner._update_states() due to not found req_id in self.input_batch.req_id_to_index, i’m curious when will the draft tokens insert to token_ids_cpu.
When pipeline parallelism (PP) is enabled, the scheduler sends the sampled tokens (including speculative/draft tokens) back to the model runner, because there is no direct communication between the first-stage and last-stage workers. In this case, the update to input_batch.token_ids_cpu
for spec_token_ids is handled in the _update_states
method, but only for requests present in self.input_batch.req_id_to_index
. If a request is not found there (e.g., new or resumed requests), it is added to the persistent batch via self.input_batch.add_request(req_state)
, which copies both prompt and output token IDs—including any draft/speculative tokens—into token_ids_cpu
at that point. Thus, for PP, draft tokens are inserted into token_ids_cpu
when the request is (re-)added to the batch, not during the main update loop for running requests. See the relevant logic in gpu_model_runner.py and discussion in PR #20291.
Would you like a step-by-step breakdown of the code paths involved?
Sources:
if the are 2 batches in queue, after batch_0 prefills and batch_1 prefills, when batch_0 decoding, req_id_to_index only contains batch_1 reqs, and the spec_token_ids will not be appended.
It seems likely that if req_id_to_index
only contains batch_1 requests during batch_0’s decoding, then speculative tokens (spec_token_ids
) for batch_0 will not be appended to input_batch.token_ids_cpu
in _update_states
. This is because the code only updates token_ids_cpu
for requests present in req_id_to_index
, and new/resumed requests are added to the batch after this check. As a result, if a request is not in the persistent batch at the time of decoding, its speculative tokens may not be inserted as expected. This behavior is discussed as a source of inconsistency in vLLM issue #20247.
Would you like a detailed explanation of how this can lead to token count mismatches between the scheduler and model runner?
Sources:
when InputBatch.remove_request() will be called?
InputBatch.remove_request()
is called when a request is finished, aborted, or unscheduled (e.g., preempted or not scheduled in the current step). This method removes the request from the batch and must be followed by a call to condense()
to compact the batch. See gpu_input_batch.py and API docs.
Would you like more detail on the specific scenarios or code paths where this occurs?
Sources: