INFO 08-12 13:28:03 [loggers.py:118] Engine 000: Avg prompt throughput: 194.4 tokens/s, Avg generation throughput: 53.5 tokens/s, Running: 5 reqs, Waiting: 1 reqs, GPU KV cache usage: 99.1%, Prefix cache hit rate: 55.1%
INFO 08-12 13:28:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 38.7 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 92.8%, Prefix cache hit rate: 53.6%
INFO 08-12 13:28:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.0 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 93.6%, Prefix cache hit rate: 55.4%
INFO 08-12 13:28:33 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.4 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 94.4%, Prefix cache hit rate: 56.9%
INFO 08-12 13:28:43 [loggers.py:118] Engine 000: Avg prompt throughput: 619.0 tokens/s, Avg generation throughput: 46.0 tokens/s, Running: 4 reqs, Waiting: 3 reqs, GPU KV cache usage: 99.0%, Prefix cache hit rate: 57.5%
INFO 08-12 13:28:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 57.6 tokens/s, Running: 4 reqs, Waiting: 3 reqs, GPU KV cache usage: 99.8%, Prefix cache hit rate: 56.8%
INFO 08-12 13:29:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 47.4 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 92.8%, Prefix cache hit rate: 58.3%
INFO 08-12 13:29:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 93.4%, Prefix cache hit rate: 60.2%
INFO 08-12 13:29:23 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 94.0%, Prefix cache hit rate: 61.5%
INFO 08-12 13:29:33 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 94.6%, Prefix cache hit rate: 62.3%
INFO 08-12 13:29:43 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.2 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 95.2%, Prefix cache hit rate: 62.7%
INFO 08-12 13:29:53 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 95.8%, Prefix cache hit rate: 62.8%
INFO 08-12 13:30:03 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 96.4%, Prefix cache hit rate: 62.6%
INFO 08-12 13:30:13 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 25.4 tokens/s, Running: 3 reqs, Waiting: 4 reqs, GPU KV cache usage: 97.7%, Prefix cache hit rate: 62.4%
INFO 08-12 13:30:23 [loggers.py:118] Engine 000: Avg prompt throughput: 227.6 tokens/s, Avg generation throughput: 32.6 tokens/s, Running: 6 reqs, Waiting: 0 reqs, GPU KV cache usage: 94.4%, Prefix cache hit rate: 62.4%
INFO 08-12 13:30:33 [loggers.py:118] Engine 000: Avg prompt throughput: 475.8 tokens/s, Avg generation throughput: 47.9 tokens/s, Running: 4 reqs, Waiting: 1 reqs, GPU KV cache usage: 96.9%, Prefix cache hit rate: 61.9%
INFO 08-12 13:30:43 [loggers.py:118] Engine 000: Avg prompt throughput: 414.7 tokens/s, Avg generation throughput: 31.4 tokens/s, Running: 4 reqs, Waiting: 1 reqs, GPU KV cache usage: 96.3%, Prefix cache hit rate: 61.5%
INFO 08-12 13:30:53 [loggers.py:118] Engine 000: Avg prompt throughput: 292.0 tokens/s, Avg generation throughput: 50.7 tokens/s, Running: 4 reqs, Waiting: 2 reqs, GPU KV cache usage: 97.8%, Prefix cache hit rate: 61.4%
INFO 08-12 13:31:03 [loggers.py:118] Engine 000: Avg prompt throughput: 630.2 tokens/s, Avg generation throughput: 28.2 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 25.1%, Prefix cache hit rate: 61.2%
INFO 08-12 13:31:13 [loggers.py:118] Engine 000: Avg prompt throughput: 686.7 tokens/s, Avg generation throughput: 20.5 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 22.5%, Prefix cache hit rate: 61.2%
INFO 08-12 13:31:23 [loggers.py:118] Engine 000: Avg prompt throughput: 212.3 tokens/s, Avg generation throughput: 44.4 tokens/s, Running: 5 reqs, Waiting: 0 reqs, GPU KV cache usage: 24.6%, Prefix cache hit rate: 61.1%
INFO 08-12 13:31:33 [loggers.py:118] Engine 000: Avg prompt throughput: 1061.9 tokens/s, Avg generation throughput: 25.9 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 29.7%, Prefix cache hit rate: 61.1%
部分日志如上,有几点疑问想请教:
1、GPU KV cache usage 和 Prefix cache hit rate 分别表示什么含义,如何根据这2个值判断缓存命中效果?
2、08-12 13:30:53 的 GPU KV cache usage: 97.8%,08-12 13:31:03 的 GPU KV cache usage: 25.1%,为什么会突然变小很多?
3、数值呈现什么走势时,表示已经达到瓶颈? 瓶颈是指“usage增长、hit不增长”吗?
4、瓶颈和什么有关?并发数 or 输入输出内容?