根据 GPU KV cache usage 和 Prefix cache hit rate 评判缓存命中效果

INFO 08-12 13:28:03 [loggers.py:118] Engine 000: Avg prompt throughput: 194.4 tokens/s, Avg generation throughput: 53.5 tokens/s, Running: 5 reqs, Waiting: 1 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 55.1%
INFO 08-12 13:28:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.7 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 92.8%, Prefix cache hit rate: 53.6%
INFO 08-12 13:28:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.0 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 55.4%
INFO 08-12 13:28:33 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.4 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 94.4%, Prefix cache hit rate: 56.9%
INFO 08-12 13:28:43 [loggers.py:118] Engine 000: Avg prompt throughput: 619.0 tokens/s, Avg generation throughput: 46.0 tokens/s, Running: 4 reqs, Waiting: 3 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 57.5%
INFO 08-12 13:28:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.6 tokens/s, Running: 4 reqs, Waiting: 3 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 56.8%
INFO 08-12 13:29:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 47.4 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 92.8%, Prefix cache hit rate: 58.3%
INFO 08-12 13:29:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 93.4%, Prefix cache hit rate: 60.2%
INFO 08-12 13:29:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 94.0%, Prefix cache hit rate: 61.5%
INFO 08-12 13:29:33 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 94.6%, Prefix cache hit rate: 62.3%
INFO 08-12 13:29:43 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 95.2%, Prefix cache hit rate: 62.7%
INFO 08-12 13:29:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 95.8%, Prefix cache hit rate: 62.8%
INFO 08-12 13:30:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 96.4%, Prefix cache hit rate: 62.6%
INFO 08-12 13:30:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.4 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 97.7%, Prefix cache hit rate: 62.4%
INFO 08-12 13:30:23 [loggers.py:118] Engine 000: Avg prompt throughput: 227.6 tokens/s, Avg generation throughput: 32.6 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 94.4%, Prefix cache hit rate: 62.4%
INFO 08-12 13:30:33 [loggers.py:118] Engine 000: Avg prompt throughput: 475.8 tokens/s, Avg generation throughput: 47.9 tokens/s, Running: 4 reqs, Waiting: 1 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 61.9%
INFO 08-12 13:30:43 [loggers.py:118] Engine 000: Avg prompt throughput: 414.7 tokens/s, Avg generation throughput: 31.4 tokens/s, Running: 4 reqs, Waiting: 1 reqs, GPU KV cache usage: 96.3%, Prefix cache hit rate: 61.5%
INFO 08-12 13:30:53 [loggers.py:118] Engine 000: Avg prompt throughput: 292.0 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 97.8%, Prefix cache hit rate: 61.4%
INFO 08-12 13:31:03 [loggers.py:118] Engine 000: Avg prompt throughput: 630.2 tokens/s, Avg generation throughput: 28.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.1%, Prefix cache hit rate: 61.2%
INFO 08-12 13:31:13 [loggers.py:118] Engine 000: Avg prompt throughput: 686.7 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.5%, Prefix cache hit rate: 61.2%
INFO 08-12 13:31:23 [loggers.py:118] Engine 000: Avg prompt throughput: 212.3 tokens/s, Avg generation throughput: 44.4 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.6%, Prefix cache hit rate: 61.1%
INFO 08-12 13:31:33 [loggers.py:118] Engine 000: Avg prompt throughput: 1061.9 tokens/s, Avg generation throughput: 25.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.7%, Prefix cache hit rate: 61.1%

部分日志如上,有几点疑问想请教:

1、GPU KV cache usage 和 Prefix cache hit rate 分别表示什么含义,如何根据这2个值判断缓存命中效果?

2、08-12 13:30:53 的 GPU KV cache usage: 97.8%,08-12 13:31:03 的 GPU KV cache usage: 25.1%,为什么会突然变小很多?

3、数值呈现什么走势时,表示已经达到瓶颈? 瓶颈是指“usage增长、hit不增长”吗?

4、瓶颈和什么有关?并发数 or 输入输出内容?

1、GPU KV cache usage 表示当前 GPU 上 KV 缓存的使用率(已用缓存块数/总缓存块数),Prefix cache hit rate 表示前缀缓存命中率(即请求前缀有多少比例能直接复用缓存)。高 usage 意味着缓存接近满载,命中率高说明缓存复用效果好。一般来说,usage 越高且 hit rate 越高,缓存带来的加速效果越明显。可以通过这两个指标判断缓存是否有效:usage 高但 hit rate 低,说明缓存空间被占满但复用不佳;usage 低则说明缓存压力小,复用空间大。官方文档说明代码实现

2、GPU KV cache usage 从 97.8% 突然降到 25.1%,通常是因为有大量请求完成或被清理,缓存块被释放,导致使用率骤降。这是正常的缓存回收行为,说明有一批请求生命周期结束,缓存被腾空。相关日志实现

3、当 usage 长时间高位且 hit rate 不再提升,说明缓存已达瓶颈(即“usage 增长、hit 不增长”),此时缓存空间不足以容纳更多可复用前缀,命中率难以提升。瓶颈出现时,增加并发或输入长度会导致缓存频繁淘汰,影响性能。

4、瓶颈与并发数、输入/输出内容密切相关。高并发、长输入、重复性高的请求更容易让缓存命中率提升,但也更容易达到瓶颈。此时可以考虑增大缓存空间(如提升 gpu_memory_utilization),或优化请求分布以提升缓存复用。官方指标说明

需要更详细的原理或代码分析吗?

Sources: