Will using Prefix Cache cause LLM to generate repetitive responses?

In my current use case, to improve the hit rate of the Prefix Cache, I recently adjusted the distribution strategy of model nodes—routing all requests from the same conversation to the same service node. Subsequently, users have reported a noticeable increase in repetitive responses, and I suspect there is a connection between these two changes.

Yes, there is a likely connection. Routing all requests from the same conversation to the same node increases Prefix Cache hit rate, but if prefix caching is enabled and your model or workload involves multimodal inputs (e.g., images) or high concurrency, vLLM’s prefix cache may not always distinguish between requests with similar text but different non-text content. This can cause incorrect or repetitive outputs, as the cache may reuse KV blocks for prompts that are textually identical but differ in other aspects, leading to repeated or garbled responses. Disabling prefix caching (--no-enable-prefix-caching) has been shown to resolve such issues in similar scenarios, especially with multimodal or highly concurrent workloads, as reported by multiple users in the vLLM community. The cache key may not fully account for all input differences, causing these artifacts when cache hits occur for non-identical requests. See the discussion and user reports in vLLM Issue #20261.

Would you like more detail on how prefix caching interacts with your workload or on mitigation strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

Currently, the Prefix Cache in vLLM v1 is enabled by default, which may affect the model’s performance.