Prefix cache hit rate 的原理和计算

以相同message,连续多次请求推理服务(日志如下),Prefix cache hit rate 分别为0、1/2、2/3、3/4、4/5,非常有规律,麻烦说明一下 Prefix cache hit rate 的原理和计算

INFO 07-18 09:42:17 [chat_utils.py:397] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 07-18 09:42:17 [logger.py:39] Received request chatcmpl-6575a190105c45a0a33ff6c33a4e6623: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65109 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:17 [async_llm.py:252] Added request chatcmpl-6575a190105c45a0a33ff6c33a4e6623.
INFO 07-18 09:42:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:43 [logger.py:39] Received request chatcmpl-ecdfeaeff8a3453ba779437f5f335819: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65197 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:43 [async_llm.py:252] Added request chatcmpl-ecdfeaeff8a3453ba779437f5f335819.
INFO 07-18 09:42:47 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 50.0%
INFO 07-18 09:42:53 [logger.py:39] Received request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:53 [async_llm.py:252] Added request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4.
INFO 07-18 09:42:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 66.7%
INFO 07-18 09:43:04 [logger.py:39] Received request chatcmpl-787f000290a44ae18b97c8e9b680ff66: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:04 [async_llm.py:252] Added request chatcmpl-787f000290a44ae18b97c8e9b680ff66.
INFO 07-18 09:43:07 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 75.0%
INFO 07-18 09:43:11 [logger.py:39] Received request chatcmpl-7921f330a5934e0481daee2fc10602d1: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:11 [async_llm.py:252] Added request chatcmpl-7921f330a5934e0481daee2fc10602d1.
INFO 07-18 09:43:17 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 17.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 80.0%
INFO 07-18 09:43:20 [logger.py:39] Received request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:20 [async_llm.py:252] Added request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a.
INFO 07-18 09:43:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:54 [logger.py:39] Received request chatcmpl-4a10fdc5f9824177af871c1384e832f9: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO:     10.45.230.17:65421 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:54 [async_llm.py:252] Added request chatcmpl-4a10fdc5f9824177af871c1384e832f9.
INFO 07-18 09:43:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:07 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:17 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%

Prefix cache hit rate(前缀缓存命中率)表示当前请求的前缀(prompt)中,有多少比例的token能直接复用KV缓存,无需重新计算。其计算方式为:命中缓存的token数 / 总查询token数。例如,连续多次用相同message请求时,第一次命中率为0,第二次为1/2,第三次为2/3,依次类推,体现了缓存逐步被填充和复用的过程【每次新请求都能多命中一部分已缓存的前缀】。这种规律是因为vLLM的prefix cache是按block(块)存储的,只有完整block才会被缓存,后续请求能直接命中这些block,提升推理速度。详细原理和计算方式可参考官方文档说明:Prefix Cache metrics

如果需要更详细的原理、源码流程或缓存结构图解,请告知。

Sources:

完全相同的message,多次请求为什么 prefix cache hit 会有不同?如果是“只有完整block才会被缓存”,那么后续请求即使不是100%,是不是也应该保持相同,现在是在变化提升

完全相同的message多次请求时,prefix cache hit rate 之所以会逐步提升而不是一开始就100%,是因为只有“完整block”才会被缓存,部分block不会立即缓存。首次请求时,所有block都需计算;第二次请求时,前面已缓存的完整block可以命中,未满的block还需重新计算,因此命中率提升但未达100%;随着请求次数增加,更多block被填满并缓存,命中率逐步上升【详见官方原理说明](Automatic Prefix Caching - vLLM)】。

这种逐步提升是由block分配和缓存机制决定的,并非每次都能100%命中。只要请求内容完全一致,命中率的提升过程是确定的、可复现的,不会随机波动。只有在block被回收或缓存被清空时,命中率才会下降或重置。【详细机制见官方文档](Automatic Prefix Caching - vLLM)】。需要源码级解释或流程图可进一步说明。

Sources: