Prefix cache hit rate 的原理和计算

XiaoDouGeGe · July 18, 2025, 1:49am

以相同message，连续多次请求推理服务（日志如下），Prefix cache hit rate 分别为0、1/2、2/3、3/4、4/5，非常有规律，麻烦说明一下 Prefix cache hit rate 的原理和计算

INFO 07-18 09:42:17 [chat_utils.py:397] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 07-18 09:42:17 [logger.py:39] Received request chatcmpl-6575a190105c45a0a33ff6c33a4e6623: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65109 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:17 [async_llm.py:252] Added request chatcmpl-6575a190105c45a0a33ff6c33a4e6623.
INFO 07-18 09:42:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:43 [logger.py:39] Received request chatcmpl-ecdfeaeff8a3453ba779437f5f335819: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65197 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:43 [async_llm.py:252] Added request chatcmpl-ecdfeaeff8a3453ba779437f5f335819.
INFO 07-18 09:42:47 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 50.0%
INFO 07-18 09:42:53 [logger.py:39] Received request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:53 [async_llm.py:252] Added request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4.
INFO 07-18 09:42:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 66.7%
INFO 07-18 09:43:04 [logger.py:39] Received request chatcmpl-787f000290a44ae18b97c8e9b680ff66: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:04 [async_llm.py:252] Added request chatcmpl-787f000290a44ae18b97c8e9b680ff66.
INFO 07-18 09:43:07 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 75.0%
INFO 07-18 09:43:11 [logger.py:39] Received request chatcmpl-7921f330a5934e0481daee2fc10602d1: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:11 [async_llm.py:252] Added request chatcmpl-7921f330a5934e0481daee2fc10602d1.
INFO 07-18 09:43:17 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 17.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 80.0%
INFO 07-18 09:43:20 [logger.py:39] Received request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:20 [async_llm.py:252] Added request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a.
INFO 07-18 09:43:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:54 [logger.py:39] Received request chatcmpl-4a10fdc5f9824177af871c1384e832f9: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65421 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:54 [async_llm.py:252] Added request chatcmpl-4a10fdc5f9824177af871c1384e832f9.
INFO 07-18 09:43:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:07 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:17 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%

RunLLM · July 18, 2025, 1:50am

Prefix cache hit rate（前缀缓存命中率）表示当前请求的前缀（prompt）中，有多少比例的token能直接复用KV缓存，无需重新计算。其计算方式为：命中缓存的token数 / 总查询token数。例如，连续多次用相同message请求时，第一次命中率为0，第二次为1/2，第三次为2/3，依次类推，体现了缓存逐步被填充和复用的过程【每次新请求都能多命中一部分已缓存的前缀】。这种规律是因为vLLM的prefix cache是按block（块）存储的，只有完整block才会被缓存，后续请求能直接命中这些block，提升推理速度。详细原理和计算方式可参考官方文档说明：Prefix Cache metrics。

如果需要更详细的原理、源码流程或缓存结构图解，请告知。

Sources:

Prefix Cache metrics

XiaoDouGeGe · July 18, 2025, 2:03am

完全相同的message，多次请求为什么 prefix cache hit 会有不同？如果是“只有完整block才会被缓存”，那么后续请求即使不是100%，是不是也应该保持相同，现在是在变化提升

RunLLM · July 18, 2025, 2:03am

完全相同的message多次请求时，prefix cache hit rate 之所以会逐步提升而不是一开始就100%，是因为只有“完整block”才会被缓存，部分block不会立即缓存。首次请求时，所有block都需计算；第二次请求时，前面已缓存的完整block可以命中，未满的block还需重新计算，因此命中率提升但未达100%；随着请求次数增加，更多block被填满并缓存，命中率逐步上升【详见官方原理说明](Automatic Prefix Caching - vLLM)】。

这种逐步提升是由block分配和缓存机制决定的，并非每次都能100%命中。只要请求内容完全一致，命中率的提升过程是确定的、可复现的，不会随机波动。只有在block被回收或缓存被清空时，命中率才会下降或重置。【详细机制见官方文档](Automatic Prefix Caching - vLLM)】。需要源码级解释或流程图可进一步说明。

Sources:

Prefix Caching Block Allocation

Topic		Replies	Views
Why is the prefix cache hit rate constantly increasing KV-Cache	3	371	March 27, 2025
根据 GPU KV cache usage 和 Prefix cache hit rate 评判缓存命中效果 General	1	49	August 13, 2025
Should vLLM consider prefix caching when chunked prefill is enabled? General	1	91	June 24, 2025
优化prompt来提高模型API调用的Prefill命中率 General	3	42	July 30, 2025
使用以下2种方式，获得的结果有很大差异 Benchmarking	50	334	July 25, 2025

Prefix cache hit rate 的原理和计算

Related topics