We encountered an issue where requests with the same prefix were stably appearing with endless repetition. The cause of the first request’s infinite repetition is unknown, but it was later discovered that other requests with the same prefix (usually the same system prompt) were also stably repeating. Slightly modifying the system prompt of the request immediately eliminated the repetition. We suspect this is related to the prefix-cache feature, but after consulting the relevant documentation, we found that there is only an option to turn it off, and no option to adjust the invalidation time.
vLLM’s prefix cache (automatic prefix caching) reuses cached key-value (KV) blocks for identical token prefixes, such as the same system prompt, to speed up inference. However, if a request with a certain prefix leads to an infinite repetition or corrupted output, subsequent requests with the same prefix may also reuse the problematic cached blocks, causing repeated errors. Modifying the system prompt changes the prefix hash, so the cache is not reused and the issue disappears. There is currently no built-in option to set a cache invalidation time (TTL); cache eviction is managed automatically using an LRU (least recently used) policy, and blocks are only evicted when the KV cache is full and new requests require memory. You can only fully disable prefix caching with the --no-enable-prefix-caching flag or reset the cache via the API, but cannot set a custom expiration time for cached prefixes as of now. See details in prefix caching design, cache eviction, and API for cache reset.
Would you like more detail on how to programmatically reset the prefix cache or on the LRU eviction mechanism?
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM how to programmatically reset the prefix cache or on the LRU eviction mechanism?