Why is the prefix cache hit rate constantly increasing

I input 207 tokens and output 30, block_size=32, In theory, a prompt of 207 tokens hits 192 tokens, and the remaining 15 are not enough to form a block. Therefore, the prefix hit rate is 92.75% and should be a fixed value
In fact, during serial testing of the/v1/chat/completeness interface, it was found that as the number of request rounds increases, the prefix cache hit rate: GPU keeps increasing. Why is this?

Can you share more information such as your engine config, and the logs?

I already know why.
Metrics records the hit rate of all request prefixes, not the rate of each request hitting a block
For example, if the same request is made multiple times, the prefix cache hit rate and GPU are:
req 1,0。
req 2,1/2=50%。
req 3,2/3=66.67%。
req 4,3/4=75%。
req 5,4/5=80%。

1 Like

That’s right because the log is based on time interval so it shows the metrics from the overall system perspective instead of individual requests. In v0, the hit rate is all processed requests since the engine launched. In v1, the hit rate is based on the most recent 1k requests.

2 Likes