Hi guys, some question about V1 chunked prefils scheduling policy: it seems that it would prioritize decode at first.
- After we schedule as much as “decode”, how do we determine which prefill to be scheduled first? Is it first come first server or shorter prefill comes firsrt?
- If it is first come first serve, does it mean we could at most have 2 partial prefills? 1 from the previous iteration that is chunked, and another one from the current iteration that we just chunked to fill into the batch.
- If it is shorter comes first, how do we prevent a long prefill gets waiting forever?
2 Likes
cc @comaniac Thanks for the explanation! I couldn’t think of an example where at a given time step: there would be more than 2 partial prefills?
In the current policy no. Another example in Time 2:
- Req 2 comes with 3 (<10) prompt tokens.
- Req 3 comes with 10 prompt tokens.
- Schedule a batch with 11-14 tokens of Req 1, plus 1-3 tokens of Req 2, plus 1-3 tokens of Req 3.
In this case we still have 2 partial prefills (Req 1 and Req 3). As a result, it will never have more than 2 patrial prefills.
2 Likes
Oh actually it might be possible with multi-modality inputs: Assuming a prompt with images looks as follows:
10 text tokens + 100 image tokens + 20 text tokens
In this case after scheduling the first 10 text tokens, we have to schedule an encoder request to encode the 100 images tokens. However, if there are too many images already so that we don’t have encoder budget to schedule this image, then we cannot continue prefilling this request. In this case, we will pause this request and schedule the text tokens of a new prefill request. As a result, we will have more than one partial prefills.
1 Like
Thanks for your examples, but I have a question here.
Considering your example here, If using the step_with_batch_queue()
function in /vllm/v1/engine/core.py
, And I have two requests in waiting
queue: [R1 (14 tokens), R2 (10 tokens)]
, the token_budget
is 10
. And step_with_batch_queue()
says that it won’t execute the update_from_output()
until all the requests are scheduled. Let’s assume the batch_queue
is large enough.
Time 1:
- Schedule the 1-10 tokens in
R1
, because token_budget
(10 tokens) constraint. And then, the R1
moved from waiting
queue to the running
queue. In the meantime add the R1
into scheduled_req_ids
which recording the request until call the update_from_output
which will remove requests from the scheduled_req_ids
. So now the scheduled_req_ids
is {R1}
- After scheduling successfully, the
batch_queue
become [R1 (10 tokens)]
, the running
queue is [R1]
, the waiting
queue is [R2]
.
Time 2:
- Because there exists unscheduled request
R2
(10 tokens), so instead of waiting the model_executor
to output model_output
, the step_with_batch_queue()
will decide to schedule the R2
to the running
. And then add the R2
to batch_queue
.
- After scheduling, the
batch_queue
is [R1 (10 tokens), R2 (10 tokens)]
, and the running
queue is also [R1, R2]
, the scheduled_req_ids
is {R1, R2}
, the waiting
is []
(i.e. empty).
Time 3:
- At this time, the
scheduled_req_ids = {R1, R2}
, so step_with_batch_queue()
will block and waiting for the model_output
.
- Once the
model_output
is available, we call update_from_output()
and remove the R1
from scheduled_req_ids
, the running
queue will remain [R1, R2]
, but the scheduled_req_ids
become {R2}
, and waiting
is remain empty.
Time 4:
- Because after output the first
R1
with 10 tokens, the R1
is not in the scheduled_req_ids
, so instead of waiting the model_executor
to output R2
, the step_with_batch_queue()
will schedule the R1
with its left 4 tokens. Then add the R1
to scheduled_req_ids
also to the batch_queue
.
- It’s worth to notice that in this time, the
running
queue is [R1, R2]
, but the batch_queue is [R2 (10 tokens), R1 (4 tokens)]
. So we can see that the internal order of running
queue and batch_queue
is not same. (i.e the schedule priority of R1
is larger than R2
, but the execute order is not same in this time point)
Time 5:
- The
R2
finished in batch_queue
, and prepare to schedule the one decode token of R2
…
So in this case, can there be multiple chunked prefill in different batch? Simply, like the R1
has 15 tokens, R2
has 15 tokens, R3
has 15 tokens. They are all in the waiting
queue, the token_budget
is 10, and after multiple scheduling in step_with_batch_queue()
, there will be 3 partial prefills in different batches at batch_queue
: [R1 (10), R2 (10), R3 (10)]
?
And I agree that there are at most 2 partial prefills in the same batch like what you said.
So whether your example is specify the setp()
method in core.py
, not the step_with_batch_queue()
method?
Yes you are absolutely right. I didn’t cover the cases with batch queue, because it is only used when pipeline parallelism is enabled. When PP is enabled and batch queue is used, then we may have many partial prefill requests on the fly.
Thanks very much for your explanation