以相同message,连续多次请求推理服务(日志如下),Prefix cache hit rate 分别为0、1/2、2/3、3/4、4/5,非常有规律,麻烦说明一下 Prefix cache hit rate 的原理和计算
INFO 07-18 09:42:17 [chat_utils.py:397] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
INFO 07-18 09:42:17 [logger.py:39] Received request chatcmpl-6575a190105c45a0a33ff6c33a4e6623: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65109 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:17 [async_llm.py:252] Added request chatcmpl-6575a190105c45a0a33ff6c33a4e6623.
INFO 07-18 09:42:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 07-18 09:42:43 [logger.py:39] Received request chatcmpl-ecdfeaeff8a3453ba779437f5f335819: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65197 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:43 [async_llm.py:252] Added request chatcmpl-ecdfeaeff8a3453ba779437f5f335819.
INFO 07-18 09:42:47 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 50.0%
INFO 07-18 09:42:53 [logger.py:39] Received request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65228 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:42:53 [async_llm.py:252] Added request chatcmpl-8b8c1968f1a84f32843ab185c3e533a4.
INFO 07-18 09:42:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 14.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 66.7%
INFO 07-18 09:43:04 [logger.py:39] Received request chatcmpl-787f000290a44ae18b97c8e9b680ff66: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:04 [async_llm.py:252] Added request chatcmpl-787f000290a44ae18b97c8e9b680ff66.
INFO 07-18 09:43:07 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 75.0%
INFO 07-18 09:43:11 [logger.py:39] Received request chatcmpl-7921f330a5934e0481daee2fc10602d1: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:11 [async_llm.py:252] Added request chatcmpl-7921f330a5934e0481daee2fc10602d1.
INFO 07-18 09:43:17 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 17.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 80.0%
INFO 07-18 09:43:20 [logger.py:39] Received request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65261 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:20 [async_llm.py:252] Added request chatcmpl-7225cc83a10440a0ad735c4ef0a2f33a.
INFO 07-18 09:43:27 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 11.8 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:37 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 83.3%
INFO 07-18 09:43:54 [logger.py:39] Received request chatcmpl-4a10fdc5f9824177af871c1384e832f9: prompt: '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>user\n给我写一篇100字的春游作文<|im_end|>\n<|im_start|>assistant\n', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.05, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=59960, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, lora_request: None, prompt_adapter_request: None.
INFO: 10.45.230.17:65421 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 07-18 09:43:54 [async_llm.py:252] Added request chatcmpl-4a10fdc5f9824177af871c1384e832f9.
INFO 07-18 09:43:57 [loggers.py:111] Engine 000: Avg prompt throughput: 4.0 tokens/s, Avg generation throughput: 7.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:07 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 6.2 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%
INFO 07-18 09:44:17 [loggers.py:111] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 85.7%