Understanding vllm kv cache

rahulraman1604 · December 1, 2025, 7:05pm

Hoping to get a high level overview based on the vllm docs on 0.11.2 and the github code, how is the KV cache being managed. Explain from physical GPU blocks to virtual blocks how is it being managed, what are the layers of absraction provided by vllm. Explain the functions being used on allocating, accessing, reading and writing KV tensors into KV cache. I want to implement my own KV cache so I want to get the complete understanding

RunLLM · December 1, 2025, 7:05pm

vLLM manages the KV cache using a layered abstraction. At the lowest level, physical GPU memory is pre-allocated into a pool of fixed-size blocks (KVCacheBlock), each with a unique ID, hash, and reference count. These blocks are organized in a doubly linked list for efficient O(1) allocation and eviction. The block pool is managed by a free block queue, and mappings exist from hash keys (for prefix caching) and request IDs to block IDs. This design allows vLLM to efficiently allocate, reuse, and evict blocks as needed, supporting both LRU and reference-count-based eviction policies. Logical (virtual) block tables map sequences of tokens to these physical blocks, enabling prefix sharing and cache reuse across requests. The main abstractions are: Block Pool (physical memory), Free Block Queue, Cache Blocks (hash→block ID), and Request Blocks (request ID→block IDs) (docs/design/prefix_caching.md, docs/design/v1/prefix_caching.html#data-structure).

For allocation, the scheduler calls kv_cache_manager.get_computed_blocks() to find cache hits, then kv_cache_manager.allocate_slots() to allocate new blocks if needed. Accessing and updating the cache involves increasing reference counts for hits and writing new KV tensors into allocated blocks. When a request finishes or blocks are no longer needed, the reference count is decremented, and blocks are returned to the free queue if unused. The main management logic is in the KVCacheManager, which exposes these functions and coordinates between the scheduler and the block pool. For more advanced use cases (e.g., hybrid attention models, offloading, or custom policies), vLLM provides further abstraction layers like KVCacheCoordinator and pluggable connectors (docs/design/hybrid_kv_cache_manager/, api/vllm/distributed/kv_transfer/kv_connector/v1/base/). If you want to implement your own KV cache, you would need to implement the block allocation, mapping, and eviction logic, and expose compatible interfaces for allocation, access, and freeing.

Would you like a step-by-step breakdown of the code paths and function signatures involved?

Sources:

Hint: Mention @RunLLM in the post for followups.

rahulraman1604 · December 1, 2025, 7:46pm

@RunLLM yes I want the step-by-step breakdown of the code paths and function signatures involved

RunLLM · December 1, 2025, 7:47pm

It seems likely that the core code paths for KV cache management in vLLM are centered around the KVCacheManager and the KVConnectorBase_V1 abstraction. Here’s a step-by-step breakdown of the main function signatures and their roles, based on the vLLM docs and code:

Allocation and Access
- get_computed_blocks(request): Given a request, returns the sequence of blocks (cache hits) already computed for the prompt tokens (kv_cache_manager.py).
- allocate_slots(request, num_tokens, computed_blocks): Allocates new blocks for tokens not already cached, updates reference counts, and returns the allocated blocks (kv_cache_manager.py).
- KVCacheBlocks: Data structure returned by allocation, encapsulating the block assignments (pull/17479).
Reading/Writing KV Tensors
- KV tensors are written into the allocated blocks during the forward pass. The actual writing is handled internally by the model runner, but the block assignment is managed by the above allocation functions.
- For distributed or external cache, the KVConnectorBase_V1 interface provides:
  - start_load_kv(forward_context, **kwargs): Begins loading KV cache from the connector to the paged buffer (api/vllm/distributed/kv_transfer/kv_connector/v1/base/).
  - save_kv_layer(layer_name, kv_layer, attn_metadata, **kwargs): Starts saving a layer of KV cache from the paged buffer to the connector (api/vllm/distributed/kv_transfer/kv_connector/v1/base/).
Freeing and Eviction
- When a request is finished, free(request) is called to decrement reference counts and return unused blocks to the free queue (kv_cache_manager.py).
- Eviction is handled by the free block queue and LRU logic within the manager.
Connector Abstraction
- To implement a custom KV cache, subclass KVConnectorBase_V1 and implement methods like update_state_after_alloc, wait_for_layer_load, wait_for_save, request_finished, and build_connector_meta (api/vllm/distributed/kv_transfer/kv_connector/v1/base/).

Would you like a code snippet or more detail on a specific function?

Sources:

Hint: Mention @RunLLM in the post for followups.

rahulraman1604 · December 1, 2025, 9:50pm

@RunLLM I see in the documentation of automatic prefix there is block pool, while in hybrid KV cache manager docs I see memory pool. Does block pool and memory pool mean the same thing or are they different?

RunLLM · December 1, 2025, 9:50pm

In vLLM, “block pool” and “memory pool” are often used interchangeably to refer to the pre-allocated set of fixed-size blocks that manage the physical GPU memory for KV cache. The block pool is the concrete data structure (a list of KVCacheBlock objects) that tracks allocation, while the memory pool is a more general term for the underlying GPU memory these blocks represent. Both terms describe the same resource management layer (docs/design/prefix_caching.md, docs/design/hybrid_kv_cache_manager/).

If you need to distinguish: “block pool” is the software abstraction for managing allocation and reuse of blocks, and “memory pool” refers to the actual reserved GPU memory, but in vLLM documentation and code, they are functionally the same in context.

Would you like more detail or examples from the code?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Why allocating KV-caches on a per-layer basis General	5	188	August 15, 2025
How to get kv cache value from vllm KV-Cache	5	127	January 19, 2026
Custom KV cache implementation General	1	245	July 19, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	483	September 8, 2025
Where to start for implementing custom memory-block–aware scheduling in vLLM? General	3	34	December 26, 2025

Understanding vllm kv cache

Related topics