I noticed that client includes the entire dialogue history with each request in a long conversation, did vllm set the kv cache itself?
How does it identify the different conversations during parallel processing?
vLLM does not automatically maintain or identify conversation history; the client is responsible for including the full dialogue history in each request. vLLM uses Automatic Prefix Caching (APC) to cache and reuse the KV cache for shared prefixes across requests, but it does not track conversations or user sessions itself. Each request is treated independently, and KV cache reuse is based solely on matching token prefixes, not on conversation IDs or user metadata. Thus, different conversations are not explicitly identified by vLLM during parallel processing—cache hits occur only when the token sequences match exactly in the prefix portion of the prompt, regardless of which conversation or user sent the request. For multi-turn chat, the client must concatenate the full history into the prompt for each turn, and vLLM will cache/reuse any shared prefix portions as possible based on token matching, not conversation context or user ID.
For more details, see this discussion and the documentation on Automatic Prefix Caching.
Would you like more detail on how APC works or how to optimize cache reuse in multi-turn chat?
Sources:
- https://github.com/vllm-project/vllm/issues/2766
- https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/
Hint: Mention @RunLLM in the post for followups.
@RunLLM how many prefix will apc cache? it there a length limit?
APC in vLLM caches as many prefixes as can fit in the available KV cache GPU memory; there is no fixed limit on the number of prefixes, but the total is constrained by the size of the KV cache buffer. When the cache is full, vLLM evicts the least recently used (LRU) blocks to make space for new prefixes. There is no explicit length limit for a single prefix, but very long prefixes will consume more cache blocks, reducing the total number of unique prefixes that can be cached simultaneously. See design docs and APC docs.
Would you like more detail on how cache eviction or block sizing works?
Sources:
Hint: Mention @RunLLM in the post for followups.