Controlling tail ITL. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL.
Disaggregated prefilling in vLLM separates the prefill and decode phases into different instances, preventing prefill jobs from being inserted during decoding. This separation allows you to control and reduce tail inter-token latency (ITL), which is otherwise increased when prefill jobs interrupt decoding in a single-instance setup. Chunked prefill can also help, but tuning chunk size is difficult; disaggregated prefilling is a more reliable solution for tail ITL control, though it does not improve throughput. See Disaggregated Prefilling (experimental) and Optimization and Tuning.
Would you like more detail on how to configure or benchmark this feature?
Sources:
Hint: Mention @RunLLM in the post for followups.
我想问的是,在上面的上下文中什么是tail ITL?
@RunLLM 我想问的是,在上面的上下文中什么是tail ITL?
tail ITL(尾部 inter-token latency)指的是在生成序列最后几个 token 时的 inter-token latency(即相邻 token 之间的延迟),通常表现为序列末尾 token 的生成速度变慢,导致整体响应尾部延迟升高。vLLM 在没有分离 prefill 的情况下,decode 阶段可能被插入 prefill 任务,从而拉高尾部 ITL。详见官方文档。
需要更详细的原理或示意图说明吗?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM 为什么只会影响tail token的itl,不是应该所有token都应该会被chunk prefill影响吗?