Should vLLM consider prefix caching when chunked prefill is enabled?

YjyJeff · June 24, 2025, 3:33am

Hi, I am curious about how vLLM V1 works and am currently reading its source code. I find that when a request is running, the scheduler will not consider prefix caching anymore. I think it makes sense in most of the cases. However, when chunked prefill is enabled, a long prompt is split into multiple chunks. In this case, only the first chunk leverages prefix caching during execution, which may result in suboptimal cache utilization.

RunLLM · June 24, 2025, 3:34am

You are correct: in vLLM V1, when chunked prefill is enabled, only the first chunk of a long prompt benefits from prefix caching. After the initial chunk, subsequent chunks are treated as running requests, and the scheduler does not re-check for prefix cache hits, which can lead to suboptimal cache utilization. This is a known limitation and has been discussed in the community, with suggestions to improve the scheduler to be more prefix-cache-aware during chunked prefill execution, but as of now, this behavior is by design and not yet addressed in the mainline codebase. See the discussion in vllm/issues/7883 and the design docs for prefix caching.

Would you like more detail on the technical reasons for this behavior or possible workarounds?

Sources:

Topic		Replies	Views
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	12	June 2, 2025
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	11	June 2, 2025
Kv cache when disable Prefix Caching General	1	109	May 22, 2025
V1 Chunked Prefill Scheduling Policy: how prefill would be scheduled? Scheduling	8	218	March 25, 2025
Is there any roadmap to support prefix caching on dram and disk? Disaggregated Prefilling	0	33	April 8, 2025

Should vLLM consider prefix caching when chunked prefill is enabled?

Related topics