How to get kv cache value from vllm

I want to catch the kv cache from LLM model to do some other analysis, but I don’t know how to do it.

You can access the KV cache in vLLM by using the internal APIs related to KV cache management, such as those in vllm.distributed.kv_transfer.kv_connector.utils and vllm.distributed.kv_transfer.kv_lookup_buffer.base. Specifically, the model_aware_kv_ops_helper.get_kv_from_cache() function can retrieve KV cache entries, and the KVLookupBufferBase.drop_select() method allows you to select and remove KV cache entries from the lookup buffer for further analysis. These APIs are designed for advanced use and may require you to interact with vLLM’s internal data structures and tensors directly, typically within a custom extension or plugin to the vLLM codebase. There is no simple public API for extracting the KV cache for arbitrary analysis in standard inference workflows as of now.

For more details on the available methods and their usage, see vllm.distributed.kv_transfer.kv_connector.utils and vllm.distributed.kv_transfer.kv_lookup_buffer.base. Would you like a step-by-step example or code snippet for a specific use case?

Sources:

Hint: Mention @RunLLM in the post for followups.

yes, offer a step-by-step example to get the kv cache value from a LLM model, like Qwen-7B-Chat-Int8.

It seems likely that to extract the KV cache from a model like Qwen-7B-Chat-Int8 in vLLM, you would need to use internal helper classes such as model_aware_kv_ops_helper and its get_kv_from_cache() method. However, there is no direct, user-facing API for this in standard vLLM inference. The process involves accessing the KV cache after a forward pass, which typically requires modifying or extending vLLM’s codebase.

Would you like a detailed, step-by-step code example for how to use model_aware_kv_ops_helper.get_kv_from_cache() within a custom vLLM workflow? This will require some familiarity with vLLM internals and PyTorch.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM offer a detailed, step-by-step code example for how to extract kv cache within a custom vllm workflow.

vllm.distributed.kv_transfer.kv_lookup_buffer.base seems to be removed from the latest version

Currently, vLLM does not provide a public, stable API for directly extracting the KV cache from a running LLM instance in a user-facing workflow. The internal APIs such as vllm.distributed.kv_transfer.kv_lookup_buffer.base have been refactored or are not intended for direct user access, and there is no official documentation or example for extracting the KV cache in the latest vLLM releases. Most discussions and PRs (e.g., #5066, #30447) indicate that this feature is not exposed for general use, and attempts to do so require modifying vLLM internals.

If you need to persist or share KV cache, the recommended approach is to use a connector such as SharedStorageConnector or LMCacheConnectorV1 via the --kv-transfer-config argument, which allows storing and retrieving KV cache between vLLM instances, but not direct extraction for arbitrary analysis. For example, you can run vLLM with:

vllm serve Qwen/Qwen-7B-Chat-Int8 \
  --kv-transfer-config '{"kv_connector":"SharedStorageConnector","kv_role":"kv_both","kv_connector_extra_config": {"shared_storage_path": "local/path"}}'

This will store KV cache in a local file, but the format is not documented for external analysis. There is no step-by-step code example for extracting and analyzing KV cache values in the latest vLLM public APIs.

Sources:

Hint: Mention @RunLLM in the post for followups.