Disaggregated Prefilling中什么是tail itl

Controlling tail ITL. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL.

Disaggregated prefilling in vLLM separates the prefill and decode phases into different instances, preventing prefill jobs from being inserted during decoding. This separation allows you to control and reduce tail inter-token latency (ITL), which is otherwise increased when prefill jobs interrupt decoding in a single-instance setup. Chunked prefill can also help, but tuning chunk size is difficult; disaggregated prefilling is a more reliable solution for tail ITL control, though it does not improve throughput. See Disaggregated Prefilling (experimental) and Optimization and Tuning.

Would you like more detail on how to configure or benchmark this feature?

Sources:

Hint: Mention @RunLLM in the post for followups.

我想问的是,在上面的上下文中什么是tail ITL?

@RunLLM 我想问的是,在上面的上下文中什么是tail ITL?

tail ITL(尾部 inter-token latency)指的是在生成序列最后几个 token 时的 inter-token latency(即相邻 token 之间的延迟),通常表现为序列末尾 token 的生成速度变慢,导致整体响应尾部延迟升高。vLLM 在没有分离 prefill 的情况下,decode 阶段可能被插入 prefill 任务,从而拉高尾部 ITL。详见官方文档

需要更详细的原理或示意图说明吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM 为什么只会影响tail token的itl,不是应该所有token都应该会被chunk prefill影响吗?

影响主要集中在tail ITL,是因为在decode阶段,prefill任务被插入时,decode请求会被阻塞,尤其是在序列末尾(即tail token)更容易遇到这种情况,导致最后几个token的延迟显著增加。而chunked prefill本身会影响所有token的ITL,但由于调度策略优先decode,prefill插入主要影响decode队列较空、尾部token生成时的延迟,整体表现为tail ITL升高。详见官方文档

需要更详细的调度机制或实验数据说明吗?

Sources:

Hint: Mention @RunLLM in the post for followups.