How should I set kv cache in vllm?

Septemberlemon · November 18, 2025, 8:10am

I noticed that client includes the entire dialogue history with each request in a long conversation, did vllm set the kv cache itself?
How does it identify the different conversations during parallel processing?

RunLLM · November 18, 2025, 8:10am

vLLM does not automatically maintain or identify conversation history; the client is responsible for including the full dialogue history in each request. vLLM uses Automatic Prefix Caching (APC) to cache and reuse the KV cache for shared prefixes across requests, but it does not track conversations or user sessions itself. Each request is treated independently, and KV cache reuse is based solely on matching token prefixes, not on conversation IDs or user metadata. Thus, different conversations are not explicitly identified by vLLM during parallel processing—cache hits occur only when the token sequences match exactly in the prefix portion of the prompt, regardless of which conversation or user sent the request. For multi-turn chat, the client must concatenate the full history into the prompt for each turn, and vLLM will cache/reuse any shared prefix portions as possible based on token matching, not conversation context or user ID.

For more details, see this discussion and the documentation on Automatic Prefix Caching.

Would you like more detail on how APC works or how to optimize cache reuse in multi-turn chat?

Sources:

Hint: Mention @RunLLM in the post for followups.

Septemberlemon · November 18, 2025, 8:27am

@RunLLM how many prefix will apc cache? it there a length limit?

RunLLM · November 18, 2025, 8:27am

APC in vLLM caches as many prefixes as can fit in the available KV cache GPU memory; there is no fixed limit on the number of prefixes, but the total is constrained by the size of the KV cache buffer. When the cache is full, vLLM evicts the least recently used (LRU) blocks to make space for new prefixes. There is no explicit length limit for a single prefix, but very long prefixes will consume more cache blocks, reducing the total number of unique prefixes that can be cached simultaneously. See design docs and APC docs.

Would you like more detail on how cache eviction or block sizing works?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
LLM memory caching General	7	80	May 7, 2026
Kv cache when disable Prefix Caching General	1	770	May 22, 2025
How to get kv cache value from vllm KV-Cache	5	290	January 19, 2026
Exposing KV cache for recomposition / reuse beyond prefix caching? KV-Cache	1	130	January 13, 2026
How to keep chat history in VLLM？ General	1	628	June 19, 2025

How should I set kv cache in vllm?

Related topics