VLLM V1 Scheduler: Inconsistent Request Scheduling Under Token Budget Limit

The inconsistent scheduling you observe—where not all requests are scheduled in the first round even though the total token requirement exactly matches the token budget—is a known issue in vLLM V1’s scheduler logic. This is not intended behavior and is due to the way the scheduler iterates through the waiting queue and handles token budget checks, especially when chunked prefill is not enabled. The scheduler may break out of the scheduling loop early if it encounters a request that cannot be scheduled (e.g., due to a strict num_new_tokens > token_budget check), rather than continuing to try the next requests, leading to underutilization of the available token budget and variable batch sizes per step. This is confirmed by the code and by similar issues reported by users (see related discussion, see scheduler code).

Specifically, in the WAITING requests scheduling loop, if a request’s num_new_tokens exceeds the remaining token_budget and chunked prefill is not enabled, the scheduler breaks out of the loop instead of skipping that request and trying the next one. This causes the observed inconsistent and suboptimal scheduling, especially when the sum of all requests fits the budget but individual requests may not, or when the order of requests in the queue varies. This is a known limitation and has been discussed as a source of confusion and inefficiency (see issue, see code).

Would you like a step-by-step breakdown of the relevant code paths and suggestions for possible workarounds or patches?

Sources: