使用benchmark测试模型服务,指定请求数为1,是发起2个请求吗?
Starting initial single prompt test run…
Initial test run completed. Starting main benchmark run…
而在模型日志中有2个“Received request”
使用benchmark测试模型服务,指定请求数为1,是发起2个请求吗?
Starting initial single prompt test run…
Initial test run completed. Starting main benchmark run…
而在模型日志中有2个“Received request”
是的,使用 vLLM 的 benchmark_serving.py 或 vllm bench serve 进行基准测试时,即使你指定请求数为1(如 --num-prompts 1),实际上会先发送一个“预热”请求(initial single prompt test run),确保服务可用,然后才开始正式的主测试请求。因此你会在服务端日志中看到2次“Received request”记录——一次是预热,一次是正式请求。据 vllm/benchmarks/serve.py 源码 和 相关 issue 说明,这是预期行为。
需要更多细节吗?
Sources:
使用benchmark压测10个请求,并发为1,结果如下:
root@77168b88cf64:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py --backend openai --model /data-new/models/Qwen2.5-72B --served-model-name /data-new/models/Qwen2.5-72B --endpoint /v1/completions --port 8000 --dataset_name random --random-input-len 3500 --random-output-len 1000 --random-range-ratio 0.1 --num-prompts 10 --max-concurrency 1
INFO 10-11 16:28:40 [__init__.py:244] Automatically detected platform cuda.
Namespace(backend='openai', base_url=None, host='127.0.0.1', port=8000, endpoint='/v1/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data-new/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, custom_output_len=256, custom_skip_chat_template=False, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='/data-new/models/Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|██████████████████████████████████████| 10/10 [10:49<00:00, 64.99s/it]
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 649.91
Total input tokens: 35350
Total generated tokens: 10113
Request throughput (req/s): 0.02
Output token throughput (tok/s): 15.56
Total Token throughput (tok/s): 69.95
---------------Time to First Token----------------
Mean TTFT (ms): 4612.96
Median TTFT (ms): 6267.06
P99 TTFT (ms): 6994.26
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 59.76
Median TPOT (ms): 59.75
P99 TPOT (ms): 59.85
---------------Inter-token Latency----------------
Mean ITL (ms): 59.76
Median ITL (ms): 59.68
P99 ITL (ms): 61.28
==================================================
root@77168b88cf64:/vllm-workspace#
在模型服务侧增加了日志,记录每个请求的首词延迟和总耗时,11个请求的结果如下:
WARNING 10-11 16:28:51 [serving_completion.py:417] Request cmpl-759323e8dcfa468cbf7b32cab4ba38b1 API TTFT: 7056.949ms (input_tokens: 3834)
WARNING 10-11 16:29:50 [serving_completion.py:448] Request cmpl-759323e8dcfa468cbf7b32cab4ba38b1 API Total Time: 66064.814ms (input_tokens: 3834)
WARNING 10-11 16:29:50 [serving_completion.py:417] Request cmpl-daa8bbbd1c69402ea1cd8df269691abf API TTFT: 119.560ms (input_tokens: 3834)
WARNING 10-11 16:30:49 [serving_completion.py:448] Request cmpl-daa8bbbd1c69402ea1cd8df269691abf API Total Time: 59090.572ms (input_tokens: 3834)
WARNING 10-11 16:30:56 [serving_completion.py:417] Request cmpl-54a9bcf377404b24a97866ea62180f3c API TTFT: 6829.576ms (input_tokens: 3709)
WARNING 10-11 16:31:58 [serving_completion.py:448] Request cmpl-54a9bcf377404b24a97866ea62180f3c API Total Time: 68875.758ms (input_tokens: 3709)
WARNING 10-11 16:32:05 [serving_completion.py:417] Request cmpl-cf0c275dfc464727931478248f447e96 API TTFT: 6958.034ms (input_tokens: 3779)
WARNING 10-11 16:33:02 [serving_completion.py:448] Request cmpl-cf0c275dfc464727931478248f447e96 API Total Time: 64188.592ms (input_tokens: 3779)
WARNING 10-11 16:33:08 [serving_completion.py:417] Request cmpl-8992d9cad99841eeade6d4e3152ebb95 API TTFT: 6164.267ms (input_tokens: 3342)
WARNING 10-11 16:34:14 [serving_completion.py:448] Request cmpl-8992d9cad99841eeade6d4e3152ebb95 API Total Time: 71388.889ms (input_tokens: 3342)
WARNING 10-11 16:34:20 [serving_completion.py:417] Request cmpl-5348defda0194fa5a87e3527910fb33b API TTFT: 6467.356ms (input_tokens: 3509)
WARNING 10-11 16:35:16 [serving_completion.py:448] Request cmpl-5348defda0194fa5a87e3527910fb33b API Total Time: 62501.807ms (input_tokens: 3509)
WARNING 10-11 16:35:22 [serving_completion.py:417] Request cmpl-62b72b14e9454c91aa888b6b01cd8060 API TTFT: 5809.234ms (input_tokens: 3159)
WARNING 10-11 16:36:21 [serving_completion.py:448] Request cmpl-62b72b14e9454c91aa888b6b01cd8060 API Total Time: 64662.743ms (input_tokens: 3159)
WARNING 10-11 16:36:27 [serving_completion.py:417] Request cmpl-e65eb9836e7c4b31abc1e5fe9c17ad0c API TTFT: 6286.287ms (input_tokens: 3427)
WARNING 10-11 16:37:31 [serving_completion.py:448] Request cmpl-e65eb9836e7c4b31abc1e5fe9c17ad0c API Total Time: 70402.079ms (input_tokens: 3427)
WARNING 10-11 16:37:38 [serving_completion.py:417] Request cmpl-450b48b7a2964df5b031dee30371ea71 API TTFT: 6925.364ms (input_tokens: 3749)
WARNING 10-11 16:38:37 [serving_completion.py:448] Request cmpl-450b48b7a2964df5b031dee30371ea71 API Total Time: 65959.251ms (input_tokens: 3749)
WARNING 10-11 16:38:37 [serving_completion.py:417] Request cmpl-4e76e908d8484a05bdfbf06d23d3f4bb API TTFT: 83.519ms (input_tokens: 3220)
WARNING 10-11 16:39:36 [serving_completion.py:448] Request cmpl-4e76e908d8484a05bdfbf06d23d3f4bb API Total Time: 58660.928ms (input_tokens: 3220)
WARNING 10-11 16:39:36 [serving_completion.py:417] Request cmpl-d297424f9c0b42a1b6ae7c17754887cd API TTFT: 95.742ms (input_tokens: 3622)
WARNING 10-11 16:40:40 [serving_completion.py:448] Request cmpl-d297424f9c0b42a1b6ae7c17754887cd API Total Time: 63779.735ms (input_tokens: 3622)
11个请求中,第1个请求是“预热”请求,第2、10、11个请求的TTFT很小,第3-9个请求的TTFT较大。
疑问是:1)为什么会有这么大的差异性;2)“预热”请求导致第2个请求的TTFT很小,第2个请求的结果能否反应真实情况?