Disaggregated Prefilling中什么是tail itl

hyena126 · February 15, 2026, 11:49pm

Controlling tail ITL. Without disaggregated prefilling, vLLM may insert some prefill jobs during the decoding of one request. This results in higher tail latency. Disaggregated prefilling helps you solve this issue and control tail ITL.

RunLLM · February 15, 2026, 11:50pm

Disaggregated prefilling in vLLM separates the prefill and decode phases into different instances, preventing prefill jobs from being inserted during decoding. This separation allows you to control and reduce tail inter-token latency (ITL), which is otherwise increased when prefill jobs interrupt decoding in a single-instance setup. Chunked prefill can also help, but tuning chunk size is difficult; disaggregated prefilling is a more reliable solution for tail ITL control, though it does not improve throughput. See Disaggregated Prefilling (experimental) and Optimization and Tuning.

Would you like more detail on how to configure or benchmark this feature?

Sources:

Hint: Mention @RunLLM in the post for followups.

hyena126 · February 15, 2026, 11:54pm

我想问的是，在上面的上下文中什么是tail ITL？

hyena126 · February 16, 2026, 12:01am

@RunLLM 我想问的是，在上面的上下文中什么是tail ITL？

RunLLM · February 16, 2026, 12:01am

tail ITL（尾部 inter-token latency）指的是在生成序列最后几个 token 时的 inter-token latency（即相邻 token 之间的延迟），通常表现为序列末尾 token 的生成速度变慢，导致整体响应尾部延迟升高。vLLM 在没有分离 prefill 的情况下，decode 阶段可能被插入 prefill 任务，从而拉高尾部 ITL。详见官方文档。

需要更详细的原理或示意图说明吗？

Sources:

Disaggregated Prefilling (experimental)

Hint: Mention @RunLLM in the post for followups.

hyena126 · February 16, 2026, 4:02am

@RunLLM 为什么只会影响tail token的itl，不是应该所有token都应该会被chunk prefill影响吗？

RunLLM · February 16, 2026, 4:03am

影响主要集中在tail ITL，是因为在decode阶段，prefill任务被插入时，decode请求会被阻塞，尤其是在序列末尾（即tail token）更容易遇到这种情况，导致最后几个token的延迟显著增加。而chunked prefill本身会影响所有token的ITL，但由于调度策略优先decode，prefill插入主要影响decode队列较空、尾部token生成时的延迟，整体表现为tail ITL升高。详见官方文档。

需要更详细的调度机制或实验数据说明吗？

Sources:

Disaggregated Prefilling (experimental)

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
一个长输入的请求，切chunk ，比如切了4份，prefill的时候，这四个可以同时做prefill 吗，还是有依赖关系的 General	15	340	January 21, 2026
Minimum requirements for Disaggregated Prefilling? Disaggregated Prefilling	0	104	April 9, 2025
Vllm中，使能了chunk prefill, 那它的chunk size是在哪里设的，他和 long_prefill_token_threshold参数有关系吗 General	1	212	November 23, 2025
It seems that vLLM stops due to prefill General	1	377	September 22, 2025
PD分离，D端也会做prefill? General	3	177	November 13, 2025

Disaggregated Prefilling中什么是tail itl

Related topics