Controlling tail ITL. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL.
Disaggregated prefilling in vLLM separates the prefill and decode phases into different instances, preventing prefill jobs from being inserted during decoding. This separation allows you to control and reduce tail inter-token latency (ITL), which is otherwise increased when prefill jobs interrupt decoding in a single-instance setup. Chunked prefill can also help, but tuning chunk size is difficult; disaggregated prefilling is a more reliable solution for tail ITL control, though it does not improve throughput. See Disaggregated Prefilling (experimental) and Optimization and Tuning.
Would you like more detail on how to configure or benchmark this feature?
Sources:
Hint: Mention @RunLLM in the post for followups.
我想问的是,在上面的上下文中什么是tail ITL?
@RunLLM 我想问的是,在上面的上下文中什么是tail ITL?
@RunLLM 为什么只会影响tail token的itl,不是应该所有token都应该会被chunk prefill影响吗?