In disagg prefill how does decode instance know that it is running decode phase

PratishthaGaur · July 25, 2025, 6:28pm

In disagg prefill how does decode instance know that it is running decode phase. How does it know that prefill kv exists and it should not redo prefill

RunLLM · July 25, 2025, 6:28pm

In disaggregated prefilling, the decode instance determines whether prefill KV cache exists by attempting to retrieve the KV cache from the prefill instance (the KV producer) using the Connector and LookupBuffer abstractions. The decode instance uses the blocking drop_select API to fetch the KV cache for a given request; if the KV cache is found, it skips the prefill and proceeds directly to decoding. If the KV cache is not available (e.g., due to transfer failure), the decode instance will redo the prefill for that request. This logic is implemented so that the decode instance only skips prefill when the corresponding KV cache is successfully received from the prefill instance, as described in the official documentation and development notes.

Would you like a step-by-step breakdown of the code flow or more detail on the APIs involved?

Sources:

Topic		Replies	Views
Minimum requirements for Disaggregated Prefilling? Disaggregated Prefilling	0	60	April 9, 2025
Disagg Prefill timeout Disaggregated Prefilling	1	39	July 7, 2025
Does LLM.generate() differentiate between Prefill and Decode phases? Where to insert timers? General	5	35	July 24, 2025
Kv cache when disable Prefix Caching General	1	388	May 22, 2025
Can speculative decoding and prefix caching take effect simultaneously? General	3	16	August 5, 2025

In disagg prefill how does decode instance know that it is running decode phase

Related topics