Currently, I understand that enable_prefix_caching only supports reuse when the prompt prefix is exactly the same. My use case is slightly different:
I want to reuse KV cache from previously processed tokens, and recompose / concatenate KV corresponding with previous tokens.
My question is simply:
Does vLLM expose (or plan to expose) any interface to access or manipulate KV cache (e.g. KV blocks / handles), so that users could experiment with KV recomposition beyond exact prefix caching?
vLLM currently does not expose a public interface for direct user access or manipulation of KV cache blocks (e.g., to recompose or concatenate KV from arbitrary previous tokens). The only officially supported KV cache reuse is via automatic prefix caching, which works when the prompt prefix is exactly the same as a previous request, as described in the documentation and confirmed in multiple discussions. There is no documented API for arbitrary KV block access or recomposition at this time, though there are ongoing discussions and RFCs about generalized KV cache reuse and compositional context, but these are not yet user-facing features or APIs (source, source, source).
Recent RFCs propose enabling more flexible KV cache reuse (e.g., for non-prefix-complete or compositional contexts), but these are still under development and not available as public APIs. If you need to experiment with KV recomposition, you would currently need to modify vLLM internals or follow future updates as these features are developed (source, source).
Would you like more detail on the ongoing RFCs or internal design for generalized KV cache reuse?