Hoping to get a high level overview based on the vllm docs on 0.11.2 and the github code, how is the KV cache being managed. Explain from physical GPU blocks to virtual blocks how is it being managed, what are the layers of absraction provided by vllm. Explain the functions being used on allocating, accessing, reading and writing KV tensors into KV cache. I want to implement my own KV cache so I want to get the complete understanding
vLLM manages the KV cache using a layered abstraction. At the lowest level, physical GPU memory is pre-allocated into a pool of fixed-size blocks (KVCacheBlock), each with a unique ID, hash, and reference count. These blocks are organized in a doubly linked list for efficient O(1) allocation and eviction. The block pool is managed by a free block queue, and mappings exist from hash keys (for prefix caching) and request IDs to block IDs. This design allows vLLM to efficiently allocate, reuse, and evict blocks as needed, supporting both LRU and reference-count-based eviction policies. Logical (virtual) block tables map sequences of tokens to these physical blocks, enabling prefix sharing and cache reuse across requests. The main abstractions are: Block Pool (physical memory), Free Block Queue, Cache Blocks (hash→block ID), and Request Blocks (request ID→block IDs) (docs/design/prefix_caching.md, docs/design/v1/prefix_caching.html#data-structure).
For allocation, the scheduler calls kv_cache_manager.get_computed_blocks() to find cache hits, then kv_cache_manager.allocate_slots() to allocate new blocks if needed. Accessing and updating the cache involves increasing reference counts for hits and writing new KV tensors into allocated blocks. When a request finishes or blocks are no longer needed, the reference count is decremented, and blocks are returned to the free queue if unused. The main management logic is in the KVCacheManager, which exposes these functions and coordinates between the scheduler and the block pool. For more advanced use cases (e.g., hybrid attention models, offloading, or custom policies), vLLM provides further abstraction layers like KVCacheCoordinator and pluggable connectors (docs/design/hybrid_kv_cache_manager/, api/vllm/distributed/kv_transfer/kv_connector/v1/base/). If you want to implement your own KV cache, you would need to implement the block allocation, mapping, and eviction logic, and expose compatible interfaces for allocation, access, and freeing.
Would you like a step-by-step breakdown of the code paths and function signatures involved?
Sources:
- docs/design/prefix_caching.md
- docs/design/v1/prefix_caching.html#data-structure
- docs/design/hybrid_kv_cache_manager/
Hint: Mention @RunLLM in the post for followups.
@RunLLM yes I want the step-by-step breakdown of the code paths and function signatures involved
It seems likely that the core code paths for KV cache management in vLLM are centered around the KVCacheManager and the KVConnectorBase_V1 abstraction. Here’s a step-by-step breakdown of the main function signatures and their roles, based on the vLLM docs and code:
-
Allocation and Access
get_computed_blocks(request): Given a request, returns the sequence of blocks (cache hits) already computed for the prompt tokens (kv_cache_manager.py).allocate_slots(request, num_tokens, computed_blocks): Allocates new blocks for tokens not already cached, updates reference counts, and returns the allocated blocks (kv_cache_manager.py).KVCacheBlocks: Data structure returned by allocation, encapsulating the block assignments (pull/17479).
-
Reading/Writing KV Tensors
- KV tensors are written into the allocated blocks during the forward pass. The actual writing is handled internally by the model runner, but the block assignment is managed by the above allocation functions.
- For distributed or external cache, the
KVConnectorBase_V1interface provides:start_load_kv(forward_context, **kwargs): Begins loading KV cache from the connector to the paged buffer (api/vllm/distributed/kv_transfer/kv_connector/v1/base/).save_kv_layer(layer_name, kv_layer, attn_metadata, **kwargs): Starts saving a layer of KV cache from the paged buffer to the connector (api/vllm/distributed/kv_transfer/kv_connector/v1/base/).
-
Freeing and Eviction
- When a request is finished,
free(request)is called to decrement reference counts and return unused blocks to the free queue (kv_cache_manager.py). - Eviction is handled by the free block queue and LRU logic within the manager.
- When a request is finished,
-
Connector Abstraction
- To implement a custom KV cache, subclass
KVConnectorBase_V1and implement methods likeupdate_state_after_alloc,wait_for_layer_load,wait_for_save,request_finished, andbuild_connector_meta(api/vllm/distributed/kv_transfer/kv_connector/v1/base/).
- To implement a custom KV cache, subclass
Would you like a code snippet or more detail on a specific function?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM I see in the documentation of automatic prefix there is block pool, while in hybrid KV cache manager docs I see memory pool. Does block pool and memory pool mean the same thing or are they different?
In vLLM, “block pool” and “memory pool” are often used interchangeably to refer to the pre-allocated set of fixed-size blocks that manage the physical GPU memory for KV cache. The block pool is the concrete data structure (a list of KVCacheBlock objects) that tracks allocation, while the memory pool is a more general term for the underlying GPU memory these blocks represent. Both terms describe the same resource management layer (docs/design/prefix_caching.md, docs/design/hybrid_kv_cache_manager/).
If you need to distinguish: “block pool” is the software abstraction for managing allocation and reuse of blocks, and “memory pool” refers to the actual reserved GPU memory, but in vLLM documentation and code, they are functionally the same in context.
Would you like more detail or examples from the code?
Sources:
Hint: Mention @RunLLM in the post for followups.