Benchmark initial single prompt test

使用benchmark测试模型服务,指定请求数为1,是发起2个请求吗?

Starting initial single prompt test run…
Initial test run completed. Starting main benchmark run…

而在模型日志中有2个“Received request”

是的,使用 vLLM 的 benchmark_serving.py 或 vllm bench serve 进行基准测试时,即使你指定请求数为1(如 --num-prompts 1),实际上会先发送一个“预热”请求(initial single prompt test run),确保服务可用,然后才开始正式的主测试请求。因此你会在服务端日志中看到2次“Received request”记录——一次是预热,一次是正式请求。据 vllm/benchmarks/serve.py 源码相关 issue 说明,这是预期行为。

需要更多细节吗?

Sources:

使用benchmark压测10个请求,并发为1,结果如下:

root@77168b88cf64:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai     --model /data-new/models/Qwen2.5-72B     --served-model-name /data-new/models/Qwen2.5-72B     --endpoint /v1/completions     --port 8000     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 10-11 16:28:40 [__init__.py:244] Automatically detected platform cuda.
Namespace(backend='openai', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data-new/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='/data-new/models/Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████| 10/10 [10:49<00:00, 64.99s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  649.91    
Total input tokens:                      35350     
Total generated tokens:                  10113     
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         15.56     
Total Token throughput (tok/s):          69.95     
---------------Time to First Token----------------
Mean TTFT (ms):                          4612.96   
Median TTFT (ms):                        6267.06   
P99 TTFT (ms):                           6994.26   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.76     
Median TPOT (ms):                        59.75     
P99 TPOT (ms):                           59.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.76     
Median ITL (ms):                         59.68     
P99 ITL (ms):                            61.28     
==================================================
root@77168b88cf64:/vllm-workspace# 

在模型服务侧增加了日志,记录每个请求的首词延迟和总耗时,11个请求的结果如下:

WARNING 10-11 16:28:51 [serving_completion.py:417] Request cmpl-759323e8dcfa468cbf7b32cab4ba38b1 API TTFT: 7056.949ms (input_tokens: 3834)
WARNING 10-11 16:29:50 [serving_completion.py:448] Request cmpl-759323e8dcfa468cbf7b32cab4ba38b1 API Total Time: 66064.814ms (input_tokens: 3834)
WARNING 10-11 16:29:50 [serving_completion.py:417] Request cmpl-daa8bbbd1c69402ea1cd8df269691abf API TTFT: 119.560ms (input_tokens: 3834)
WARNING 10-11 16:30:49 [serving_completion.py:448] Request cmpl-daa8bbbd1c69402ea1cd8df269691abf API Total Time: 59090.572ms (input_tokens: 3834)
WARNING 10-11 16:30:56 [serving_completion.py:417] Request cmpl-54a9bcf377404b24a97866ea62180f3c API TTFT: 6829.576ms (input_tokens: 3709)
WARNING 10-11 16:31:58 [serving_completion.py:448] Request cmpl-54a9bcf377404b24a97866ea62180f3c API Total Time: 68875.758ms (input_tokens: 3709)
WARNING 10-11 16:32:05 [serving_completion.py:417] Request cmpl-cf0c275dfc464727931478248f447e96 API TTFT: 6958.034ms (input_tokens: 3779)
WARNING 10-11 16:33:02 [serving_completion.py:448] Request cmpl-cf0c275dfc464727931478248f447e96 API Total Time: 64188.592ms (input_tokens: 3779)
WARNING 10-11 16:33:08 [serving_completion.py:417] Request cmpl-8992d9cad99841eeade6d4e3152ebb95 API TTFT: 6164.267ms (input_tokens: 3342)
WARNING 10-11 16:34:14 [serving_completion.py:448] Request cmpl-8992d9cad99841eeade6d4e3152ebb95 API Total Time: 71388.889ms (input_tokens: 3342)
WARNING 10-11 16:34:20 [serving_completion.py:417] Request cmpl-5348defda0194fa5a87e3527910fb33b API TTFT: 6467.356ms (input_tokens: 3509)
WARNING 10-11 16:35:16 [serving_completion.py:448] Request cmpl-5348defda0194fa5a87e3527910fb33b API Total Time: 62501.807ms (input_tokens: 3509)
WARNING 10-11 16:35:22 [serving_completion.py:417] Request cmpl-62b72b14e9454c91aa888b6b01cd8060 API TTFT: 5809.234ms (input_tokens: 3159)
WARNING 10-11 16:36:21 [serving_completion.py:448] Request cmpl-62b72b14e9454c91aa888b6b01cd8060 API Total Time: 64662.743ms (input_tokens: 3159)
WARNING 10-11 16:36:27 [serving_completion.py:417] Request cmpl-e65eb9836e7c4b31abc1e5fe9c17ad0c API TTFT: 6286.287ms (input_tokens: 3427)
WARNING 10-11 16:37:31 [serving_completion.py:448] Request cmpl-e65eb9836e7c4b31abc1e5fe9c17ad0c API Total Time: 70402.079ms (input_tokens: 3427)
WARNING 10-11 16:37:38 [serving_completion.py:417] Request cmpl-450b48b7a2964df5b031dee30371ea71 API TTFT: 6925.364ms (input_tokens: 3749)
WARNING 10-11 16:38:37 [serving_completion.py:448] Request cmpl-450b48b7a2964df5b031dee30371ea71 API Total Time: 65959.251ms (input_tokens: 3749)
WARNING 10-11 16:38:37 [serving_completion.py:417] Request cmpl-4e76e908d8484a05bdfbf06d23d3f4bb API TTFT: 83.519ms (input_tokens: 3220)
WARNING 10-11 16:39:36 [serving_completion.py:448] Request cmpl-4e76e908d8484a05bdfbf06d23d3f4bb API Total Time: 58660.928ms (input_tokens: 3220)
WARNING 10-11 16:39:36 [serving_completion.py:417] Request cmpl-d297424f9c0b42a1b6ae7c17754887cd API TTFT: 95.742ms (input_tokens: 3622)
WARNING 10-11 16:40:40 [serving_completion.py:448] Request cmpl-d297424f9c0b42a1b6ae7c17754887cd API Total Time: 63779.735ms (input_tokens: 3622)

11个请求中,第1个请求是“预热”请求,第2、10、11个请求的TTFT很小,第3-9个请求的TTFT较大。

疑问是:1)为什么会有这么大的差异性;2)“预热”请求导致第2个请求的TTFT很小,第2个请求的结果能否反应真实情况?