第1种方式的结果:10次结果都很稳定,TTFT在100ms左右。
root@2f77277da063:/vllm-workspace# for i in {1..10}; do
> python3 /vllm-workspace/benchmarks/benchmark_serving.py --backend openai-chat --model /data/models/Qwen2.5-72B --served-model-name Qwen2.5-72B --endpoint /v1/chat/completions --port 8080 --dataset_name random --random-input-len 7000 --random-output-len 3000 --random-range-ratio 0.1 --num-prompts 1 --max-concurrency 1
> done
INFO 06-12 01:08:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.48s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.48
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.81
Total Token throughput (tok/s): 2026.39
---------------Time to First Token----------------
Mean TTFT (ms): 101.93
Median TTFT (ms): 101.93
P99 TTFT (ms): 101.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.07
Median TPOT (ms): 45.07
P99 TPOT (ms): 45.07
---------------Inter-token Latency----------------
Mean ITL (ms): 44.47
Median ITL (ms): 45.01
P99 ITL (ms): 46.96
==================================================
INFO 06-12 01:08:37 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
0%| | 0/1 [00:00<?, ?100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.77
Total Token throughput (tok/s): 2022.51
---------------Time to First Token----------------
Mean TTFT (ms): 106.18
Median TTFT (ms): 106.18
P99 TTFT (ms): 106.18
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.10
Median TPOT (ms): 45.10
P99 TPOT (ms): 45.10
---------------Inter-token Latency----------------
Mean ITL (ms): 44.51
Median ITL (ms): 44.95
P99 ITL (ms): 47.90
==================================================
INFO 06-12 01:08:50 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.48s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.48
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.82
Total Token throughput (tok/s): 2026.85
---------------Time to First Token----------------
Mean TTFT (ms): 105.31
Median TTFT (ms): 105.31
P99 TTFT (ms): 105.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.02
Median TPOT (ms): 45.02
P99 TPOT (ms): 45.02
---------------Inter-token Latency----------------
Mean ITL (ms): 44.42
Median ITL (ms): 44.98
P99 ITL (ms): 45.55
==================================================
INFO 06-12 01:09:04 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.78
Total Token throughput (tok/s): 2023.20
---------------Time to First Token----------------
Mean TTFT (ms): 103.25
Median TTFT (ms): 103.25
P99 TTFT (ms): 103.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.13
Median TPOT (ms): 45.13
P99 TPOT (ms): 45.13
---------------Inter-token Latency----------------
Mean ITL (ms): 44.54
Median ITL (ms): 45.06
P99 ITL (ms): 47.85
==================================================
INFO 06-12 01:09:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.79
Total Token throughput (tok/s): 2024.62
---------------Time to First Token----------------
Mean TTFT (ms): 100.20
Median TTFT (ms): 100.20
P99 TTFT (ms): 100.20
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.13
Median TPOT (ms): 45.13
P99 TPOT (ms): 45.13
---------------Inter-token Latency----------------
Mean ITL (ms): 44.54
Median ITL (ms): 45.06
P99 ITL (ms): 46.50
==================================================
INFO 06-12 01:09:32 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.53s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.54
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.28
Output token throughput (tok/s): 21.50
Total Token throughput (tok/s): 1997.10
---------------Time to First Token----------------
Mean TTFT (ms): 112.50
Median TTFT (ms): 112.50
P99 TTFT (ms): 112.50
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.61
Median TPOT (ms): 45.61
P99 TPOT (ms): 45.61
---------------Inter-token Latency----------------
Mean ITL (ms): 45.01
Median ITL (ms): 45.05
P99 ITL (ms): 55.07
==================================================
INFO 06-12 01:09:46 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.79
Total Token throughput (tok/s): 2024.17
---------------Time to First Token----------------
Mean TTFT (ms): 95.27
Median TTFT (ms): 95.27
P99 TTFT (ms): 95.27
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.21
Median TPOT (ms): 45.21
P99 TPOT (ms): 45.21
---------------Inter-token Latency----------------
Mean ITL (ms): 44.61
Median ITL (ms): 45.06
P99 ITL (ms): 48.05
==================================================
INFO 06-12 01:10:00 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.50s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.50
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.70
Total Token throughput (tok/s): 2016.01
---------------Time to First Token----------------
Mean TTFT (ms): 108.49
Median TTFT (ms): 108.49
P99 TTFT (ms): 108.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.22
Median TPOT (ms): 45.22
P99 TPOT (ms): 45.22
---------------Inter-token Latency----------------
Mean ITL (ms): 44.63
Median ITL (ms): 45.11
P99 ITL (ms): 47.00
==================================================
INFO 06-12 01:10:14 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.78
Total Token throughput (tok/s): 2023.67
---------------Time to First Token----------------
Mean TTFT (ms): 103.82
Median TTFT (ms): 103.82
P99 TTFT (ms): 103.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.11
Median TPOT (ms): 45.11
P99 TPOT (ms): 45.11
---------------Inter-token Latency----------------
Mean ITL (ms): 44.51
Median ITL (ms): 45.05
P99 ITL (ms): 45.78
==================================================
INFO 06-12 01:10:27 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.49s/it]
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 3.49
Total input tokens: 6984
Total generated tokens: 76
Request throughput (req/s): 0.29
Output token throughput (tok/s): 21.75
Total Token throughput (tok/s): 2020.42
---------------Time to First Token----------------
Mean TTFT (ms): 99.14
Median TTFT (ms): 99.14
P99 TTFT (ms): 99.14
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.24
Median TPOT (ms): 45.24
P99 TPOT (ms): 45.24
---------------Inter-token Latency----------------
Mean ITL (ms): 44.65
Median ITL (ms): 45.08
P99 ITL (ms): 47.79
==================================================
root@2f77277da063:/vllm-workspace#
第2种方式的结果:TTFT已经到达了5000+ms
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py --backend openai-chat --model /data/models/Qwen2.5-72B --served-model-name Qwen2.5-72B --endpoint /v1/chat/completions --port 8080 --dataset_name random --random-input-len 7000 --random-output-len 3000 --random-range-ratio 0.1 --num-prompts 10 --max-concurrency 1
INFO 06-12 01:14:01 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00, 9.36s/it]
============ Serving Benchmark Result ============
Successful requests: 10
Benchmark duration (s): 93.60
Total input tokens: 71443
Total generated tokens: 931
Request throughput (req/s): 0.11
Output token throughput (tok/s): 9.95
Total Token throughput (tok/s): 773.21
---------------Time to First Token----------------
Mean TTFT (ms): 5188.91
Median TTFT (ms): 5623.70
P99 TTFT (ms): 6942.57
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.27
Median TPOT (ms): 45.23
P99 TPOT (ms): 45.53
---------------Inter-token Latency----------------
Mean ITL (ms): 44.79
Median ITL (ms): 45.14
P99 ITL (ms): 48.30
==================================================
root@2f77277da063:/vllm-workspace#
并发为1时,多个请求是按序逐个执行的吗(前一个请求结束,发起后一个请求)