In disagg prefill how does decode instance know that it is running decode phase. How does it know that prefill kv exists and it should not redo prefill
In disaggregated prefilling, the decode instance determines whether prefill KV cache exists by attempting to retrieve the KV cache from the prefill instance (the KV producer) using the Connector and LookupBuffer abstractions. The decode instance uses the blocking drop_select
API to fetch the KV cache for a given request; if the KV cache is found, it skips the prefill and proceeds directly to decoding. If the KV cache is not available (e.g., due to transfer failure), the decode instance will redo the prefill for that request. This logic is implemented so that the decode instance only skips prefill when the corresponding KV cache is successfully received from the prefill instance, as described in the official documentation and development notes.
Would you like a step-by-step breakdown of the code flow or more detail on the APIs involved?
Sources: