Why is the prefix cache hit rate constantly increasing

flypanda · March 26, 2025, 2:33pm

I input 207 tokens and output 30, block_size=32， In theory, a prompt of 207 tokens hits 192 tokens, and the remaining 15 are not enough to form a block. Therefore, the prefix hit rate is 92.75% and should be a fixed value
In fact, during serial testing of the/v1/chat/completeness interface, it was found that as the number of request rounds increases, the prefix cache hit rate: GPU keeps increasing. Why is this?

comaniac · March 26, 2025, 5:31pm

Can you share more information such as your engine config, and the logs?

flypanda · March 27, 2025, 2:04am

I already know why.
Metrics records the hit rate of all request prefixes, not the rate of each request hitting a block
For example, if the same request is made multiple times, the prefix cache hit rate and GPU are:
req 1，0。
req 2，1/2=50%。
req 3，2/3=66.67%。
req 4，3/4=75%。
req 5，4/5=80%。
…

comaniac · March 27, 2025, 5:51am

That’s right because the log is based on time interval so it shows the metrics from the overall system perspective instead of individual requests. In v0, the hit rate is all processed requests since the engine launched. In v1, the hit rate is based on the most recent 1k requests.

Topic		Replies	Views
How log kvcache usage and prefix hit rate in offline infer? General	19	78	May 13, 2025
Avoiding hash collisions in prefix cache KV-Cache	7	109	March 24, 2025
Kv cache when disable Prefix Caching General	1	63	May 22, 2025
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	6	June 2, 2025
How does VRAM affect concurrent performance General	1	36	May 12, 2025

Why is the prefix cache hit rate constantly increasing

Related topics