in v1 scheulder, why not use “total_num_scheduled_tokens = self.max_num_scheduled_tokens - token_budget” ?
i think the sum function may take a long time.
“total_num_scheduled_tokens = sum(num_scheduled_tokens.values())”
in v1 scheulder, why not use “total_num_scheduled_tokens = self.max_num_scheduled_tokens - token_budget” ?
i think the sum function may take a long time.
“total_num_scheduled_tokens = sum(num_scheduled_tokens.values())”
The reason for using total_num_scheduled_tokens = sum(num_scheduled_tokens.values())
instead of self.max_num_scheduled_tokens - token_budget
is to directly and accurately count the number of tokens scheduled for all requests, regardless of how the token budget is updated or manipulated during scheduling. The sum function ensures correctness even if the token budget is changed in non-trivial ways during the scheduling process, such as when requests are partially scheduled, preempted, or resumed. This approach is explicitly shown in the vLLM v1 scheduler code and is used to maintain robust constraint checks (v1/core/sched/scheduler.py).
While the sum function may have a small performance cost, it is generally negligible compared to the overall scheduling and model inference time, and it avoids subtle bugs that could arise from relying on the token budget variable, which may not always reflect the true number of scheduled tokens due to complex scheduling logic (v1/core/sched/scheduler.py). Would you like a step-by-step breakdown of how the scheduler updates these values?
Sources:
i think the “token_buget” has taken all the scenarios into account.
It seems likely that token_budget
is carefully managed, but my understanding is that the vLLM v1 scheduler still prefers to sum num_scheduled_tokens.values()
for total_num_scheduled_tokens
to ensure correctness. This is because, during scheduling, tokens may be allocated, deallocated, or adjusted in ways that could make token_budget
out of sync with the actual scheduled tokens, especially with features like preemption, partial scheduling, or encoder-related adjustments (v1/core/sched/scheduler.py).
So, while token_budget
is intended to track available tokens, summing the scheduled tokens directly is a more robust way to verify constraints and avoid subtle bugs. Would you like a detailed example of where these two values might diverge?
Sources: