Should vLLM consider prefix caching when chunked prefill is enabled?

Hi, I am curious about how vLLM V1 works and am currently reading its source code. I find that when a request is running, the scheduler will not consider prefix caching anymore. I think it makes sense in most of the cases. However, when chunked prefill is enabled, a long prompt is split into multiple chunks. In this case, only the first chunk leverages prefix caching during execution, which may result in suboptimal cache utilization.

You are correct: in vLLM V1, when chunked prefill is enabled, only the first chunk of a long prompt benefits from prefix caching. After the initial chunk, subsequent chunks are treated as running requests, and the scheduler does not re-check for prefix cache hits, which can lead to suboptimal cache utilization. This is a known limitation and has been discussed in the community, with suggestions to improve the scheduler to be more prefix-cache-aware during chunked prefill execution, but as of now, this behavior is by design and not yet addressed in the mainline codebase. See the discussion in vllm/issues/7883 and the design docs for prefix caching.

Would you like more detail on the technical reasons for this behavior or possible workarounds?

Sources: