Why not use subtraction just for total_num_scheduled_tokens calculate in cheduler?

Easyer · May 17, 2025, 4:12am

in v1 scheulder, why not use “total_num_scheduled_tokens = self.max_num_scheduled_tokens - token_budget” ?
i think the sum function may take a long time.

“total_num_scheduled_tokens = sum(num_scheduled_tokens.values())”

RunLLM · May 17, 2025, 4:12am

The reason for using total_num_scheduled_tokens = sum(num_scheduled_tokens.values()) instead of self.max_num_scheduled_tokens - token_budget is to directly and accurately count the number of tokens scheduled for all requests, regardless of how the token budget is updated or manipulated during scheduling. The sum function ensures correctness even if the token budget is changed in non-trivial ways during the scheduling process, such as when requests are partially scheduled, preempted, or resumed. This approach is explicitly shown in the vLLM v1 scheduler code and is used to maintain robust constraint checks (v1/core/sched/scheduler.py).

While the sum function may have a small performance cost, it is generally negligible compared to the overall scheduling and model inference time, and it avoids subtle bugs that could arise from relying on the token budget variable, which may not always reflect the true number of scheduled tokens due to complex scheduling logic (v1/core/sched/scheduler.py). Would you like a step-by-step breakdown of how the scheduler updates these values?

Sources:

v1/core/sched/scheduler.py

Easyer · May 17, 2025, 6:16am

i think the “token_buget” has taken all the scenarios into account.

RunLLM · May 17, 2025, 6:16am

It seems likely that token_budget is carefully managed, but my understanding is that the vLLM v1 scheduler still prefers to sum num_scheduled_tokens.values() for total_num_scheduled_tokens to ensure correctness. This is because, during scheduling, tokens may be allocated, deallocated, or adjusted in ways that could make token_budget out of sync with the actual scheduled tokens, especially with features like preemption, partial scheduling, or encoder-related adjustments (v1/core/sched/scheduler.py).

So, while token_budget is intended to track available tokens, summing the scheduled tokens directly is a more robust way to verify constraints and avoid subtle bugs. Would you like a detailed example of where these two values might diverge?

Sources:

v1/core/sched/scheduler.py

Topic		Replies	Views
VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit General	1	43	July 30, 2025
V1 Chunked Prefill Scheduling Policy: how prefill would be scheduled? Scheduling	8	322	March 25, 2025
When will draft tokens append to input_ids General	5	6	July 16, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	171	June 29, 2025
Dose vllm V1 support asynchronous scheduling? V1 Feedback	3	308	June 27, 2025

Why not use subtraction just for total_num_scheduled_tokens calculate in cheduler?

Related topics