V1 Chunked Prefill Scheduling Policy: how prefill would be scheduled?

Hi guys, some question about V1 chunked prefils scheduling policy: it seems that it would prioritize decode at first.

  1. After we schedule as much as “decode”, how do we determine which prefill to be scheduled first? Is it first come first server or shorter prefill comes firsrt?
  2. If it is first come first serve, does it mean we could at most have 2 partial prefills? 1 from the previous iteration that is chunked, and another one from the current iteration that we just chunked to fill into the batch.
  3. If it is shorter comes first, how do we prevent a long prefill gets waiting forever?
2 Likes
  1. It’s currently FCFS, but we plan to support more scheduling policies.
  2. Currently no. We will finish prefill of the current request and start prefilling the next one. However, if there are still token budgets after scheduling the rest prompt tokens of the current request, we can batch the next prefill request. For example, assuming the chunk size is 10:

Time 1:

  • Req 1 comes with 14 prompt tokens
  • Schedule a batch with 1-10 tokens of Req 1.

Time 2:

  • Req 2 comes with 10 prompt tokens
  • Schedule a batch with 11-14 tokens of Req 1, plus 1-6 tokens of Req 2.

Time 3:

  • Schedule a batch with 1 decode token of Req 1, plus 7-10 tokens of Req 2.
  1. The current policy (FCFS) won’t have starvation.
4 Likes

cc @comaniac Thanks for the explanation! I couldn’t think of an example where at a given time step: there would be more than 2 partial prefills?

In the current policy no. Another example in Time 2:

  • Req 2 comes with 3 (<10) prompt tokens.
  • Req 3 comes with 10 prompt tokens.
  • Schedule a batch with 11-14 tokens of Req 1, plus 1-3 tokens of Req 2, plus 1-3 tokens of Req 3.

In this case we still have 2 partial prefills (Req 1 and Req 3). As a result, it will never have more than 2 patrial prefills.

2 Likes

Oh actually it might be possible with multi-modality inputs: Assuming a prompt with images looks as follows:

10 text tokens + 100 image tokens + 20 text tokens

In this case after scheduling the first 10 text tokens, we have to schedule an encoder request to encode the 100 images tokens. However, if there are too many images already so that we don’t have encoder budget to schedule this image, then we cannot continue prefilling this request. In this case, we will pause this request and schedule the text tokens of a new prefill request. As a result, we will have more than one partial prefills.

1 Like

Make sense, thank you!

Thanks for your examples, but I have a question here.

Considering your example here, If using the step_with_batch_queue() function in /vllm/v1/engine/core.py, And I have two requests in waiting queue: [R1 (14 tokens), R2 (10 tokens)], the token_budget is 10. And step_with_batch_queue() says that it won’t execute the update_from_output() until all the requests are scheduled. Let’s assume the batch_queue is large enough.

Time 1:

  • Schedule the 1-10 tokens in R1, because token_budget (10 tokens) constraint. And then, the R1 moved from waiting queue to the running queue. In the meantime add the R1 into scheduled_req_ids which recording the request until call the update_from_output which will remove requests from the scheduled_req_ids. So now the scheduled_req_ids is {R1}
  • After scheduling successfully, the batch_queue become [R1 (10 tokens)], the running queue is [R1], the waiting queue is [R2].

Time 2:

  • Because there exists unscheduled request R2 (10 tokens), so instead of waiting the model_executor to output model_output, the step_with_batch_queue() will decide to schedule the R2 to the running. And then add the R2 to batch_queue.
  • After scheduling, the batch_queue is [R1 (10 tokens), R2 (10 tokens)], and the running queue is also [R1, R2], the scheduled_req_ids is {R1, R2}, the waiting is [] (i.e. empty).

Time 3:

  • At this time, the scheduled_req_ids = {R1, R2}, so step_with_batch_queue() will block and waiting for the model_output.
  • Once the model_output is available, we call update_from_output() and remove the R1 from scheduled_req_ids, the running queue will remain [R1, R2], but the scheduled_req_ids become {R2}, and waiting is remain empty.

Time 4:

  • Because after output the first R1 with 10 tokens, the R1 is not in the scheduled_req_ids, so instead of waiting the model_executor to output R2, the step_with_batch_queue() will schedule the R1 with its left 4 tokens. Then add the R1 to scheduled_req_ids also to the batch_queue.
  • It’s worth to notice that in this time, the running queue is [R1, R2], but the batch_queue is [R2 (10 tokens), R1 (4 tokens)]. So we can see that the internal order of running queue and batch_queue is not same. (i.e the schedule priority of R1 is larger than R2, but the execute order is not same in this time point)

Time 5:

  • The R2 finished in batch_queue, and prepare to schedule the one decode token of R2

So in this case, can there be multiple chunked prefill in different batch? Simply, like the R1 has 15 tokens, R2 has 15 tokens, R3 has 15 tokens. They are all in the waiting queue, the token_budget is 10, and after multiple scheduling in step_with_batch_queue(), there will be 3 partial prefills in different batches at batch_queue: [R1 (10), R2 (10), R3 (10)]?

And I agree that there are at most 2 partial prefills in the same batch like what you said.

So whether your example is specify the setp() method in core.py, not the step_with_batch_queue() method?

Yes you are absolutely right. I didn’t cover the cases with batch queue, because it is only used when pipeline parallelism is enabled. When PP is enabled and batch queue is used, then we may have many partial prefill requests on the fly.

Thanks very much for your explanation