使用以下2种方式,获得的结果有很大差异

第1种

for i in {1..10}; do
python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
done

第2种

python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10    --max-concurrency 1

num-prompts的使用有什么注意点,应该如何设置该值

num-prompts参数决定了每次基准测试请求的prompt数量。单次循环多次(如for i in {1..10} … --num-prompts 1)和一次性批量(–num-prompts 10)都能发起10个请求,但后者能让vLLM更好地批量调度,提升吞吐量和资源利用率。根据官方说明,增加num-prompts通常能提升性能,直到达到硬件极限。

建议将num-prompts设置为较大值(如1000),以便vLLM优化批处理和吞吐,但需确保不会超出GPU显存限制。实际可用最大值需结合显存和模型大小多次测试确定。如需更详细的调优建议,请告知。

Sources:

第1种方式的结果:10次结果都很稳定,TTFT在100ms左右。

root@2f77277da063:/vllm-workspace# for i in {1..10}; do
> python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
> done
INFO 06-12 01:08:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.48      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.81     
Total Token throughput (tok/s):          2026.39   
---------------Time to First Token----------------
Mean TTFT (ms):                          101.93    
Median TTFT (ms):                        101.93    
P99 TTFT (ms):                           101.93    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.07     
Median TPOT (ms):                        45.07     
P99 TPOT (ms):                           45.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.47     
Median ITL (ms):                         45.01     
P99 ITL (ms):                            46.96     
==================================================
INFO 06-12 01:08:37 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
  0%|                                                                                                                                                                             | 0/1 [00:00<?, ?100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.77     
Total Token throughput (tok/s):          2022.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          106.18    
Median TTFT (ms):                        106.18    
P99 TTFT (ms):                           106.18    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.10     
Median TPOT (ms):                        45.10     
P99 TPOT (ms):                           45.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.51     
Median ITL (ms):                         44.95     
P99 ITL (ms):                            47.90     
==================================================
INFO 06-12 01:08:50 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.48      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.82     
Total Token throughput (tok/s):          2026.85   
---------------Time to First Token----------------
Mean TTFT (ms):                          105.31    
Median TTFT (ms):                        105.31    
P99 TTFT (ms):                           105.31    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.02     
Median TPOT (ms):                        45.02     
P99 TPOT (ms):                           45.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.42     
Median ITL (ms):                         44.98     
P99 ITL (ms):                            45.55     
==================================================
INFO 06-12 01:09:04 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.78     
Total Token throughput (tok/s):          2023.20   
---------------Time to First Token----------------
Mean TTFT (ms):                          103.25    
Median TTFT (ms):                        103.25    
P99 TTFT (ms):                           103.25    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.13     
P99 TPOT (ms):                           45.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            47.85     
==================================================
INFO 06-12 01:09:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.79     
Total Token throughput (tok/s):          2024.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          100.20    
Median TTFT (ms):                        100.20    
P99 TTFT (ms):                           100.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.13     
P99 TPOT (ms):                           45.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            46.50     
==================================================
INFO 06-12 01:09:32 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.53s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.54      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.28      
Output token throughput (tok/s):         21.50     
Total Token throughput (tok/s):          1997.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          112.50    
Median TTFT (ms):                        112.50    
P99 TTFT (ms):                           112.50    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.61     
Median TPOT (ms):                        45.61     
P99 TPOT (ms):                           45.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.01     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            55.07     
==================================================
INFO 06-12 01:09:46 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.79     
Total Token throughput (tok/s):          2024.17   
---------------Time to First Token----------------
Mean TTFT (ms):                          95.27     
Median TTFT (ms):                        95.27     
P99 TTFT (ms):                           95.27     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.21     
Median TPOT (ms):                        45.21     
P99 TPOT (ms):                           45.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.61     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            48.05     
==================================================
INFO 06-12 01:10:00 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.50s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.50      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.70     
Total Token throughput (tok/s):          2016.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          108.49    
Median TTFT (ms):                        108.49    
P99 TTFT (ms):                           108.49    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.22     
Median TPOT (ms):                        45.22     
P99 TPOT (ms):                           45.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.63     
Median ITL (ms):                         45.11     
P99 ITL (ms):                            47.00     
==================================================
INFO 06-12 01:10:14 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.78     
Total Token throughput (tok/s):          2023.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          103.82    
Median TTFT (ms):                        103.82    
P99 TTFT (ms):                           103.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.11     
Median TPOT (ms):                        45.11     
P99 TPOT (ms):                           45.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.51     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            45.78     
==================================================
INFO 06-12 01:10:27 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.75     
Total Token throughput (tok/s):          2020.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          99.14     
Median TTFT (ms):                        99.14     
P99 TTFT (ms):                           99.14     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.24     
Median TPOT (ms):                        45.24     
P99 TPOT (ms):                           45.24     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.65     
Median ITL (ms):                         45.08     
P99 ITL (ms):                            47.79     
==================================================
root@2f77277da063:/vllm-workspace# 

第2种方式的结果:TTFT已经到达了5000+ms

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:14:01 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.60     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          5188.91   
Median TTFT (ms):                        5623.70   
P99 TTFT (ms):                           6942.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.27     
Median TPOT (ms):                        45.23     
P99 TPOT (ms):                           45.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.79     
Median ITL (ms):                         45.14     
P99 ITL (ms):                            48.30     
==================================================
root@2f77277da063:/vllm-workspace# 

并发为1时,多个请求是按序逐个执行的吗(前一个请求结束,发起后一个请求)

是的,当你设置--max-concurrency 1时,vLLM的OpenAI兼容API会严格串行处理请求:只有当前请求完全结束后,才会开始下一个请求。你用--num-prompts 10时,10个请求会被依次排队,前一个请求生成完毕后才会处理下一个,所以后面请求的TTFT(Time To First Token)会非常高(如你看到的5000+ms),本质上是排队等待的时间。

这种模式下,多请求不会并发执行,而是“前一个结束,后一个才开始”,和你for循环10次、每次num-prompts=1效果类似,但后者每次只测一个请求,TTFT都很低,因为没有排队。
如需并发处理、降低TTFT,建议适当提高--max-concurrency,让多个请求能同时被调度和prefill,充分利用vLLM的批处理和并发能力。详细机制可参考官方说明

需要更详细的并发调优建议吗?

Sources:

测试不同并发下的TTFT,如果请求数>并发数,就会存在排队等待的情况,这时的TTFT就会变大;如果请求数=并发数,是不是就是1个并发对应1个请求,是合理的TTFT? 请求数和并发数该如何设置?

你的理解是对的:如果请求数 > 并发数(max-concurrency),就会有排队,后面请求的TTFT(首token延迟)会变大,因为它们要等前面的请求处理完才能开始。如果请求数 = 并发数,每个请求都能立即被调度,TTFT最小且均衡,这时每个请求都能充分利用vLLM的批处理能力,TTFT表现最合理。相关讨论

建议:

  • 并发数(max-concurrency)设置为你期望的最大同时处理请求数,受限于显存和硬件资源。
  • 请求数(num-prompts)用于测试时可等于或大于并发数,便于观察排队和调度效果。
  • 实际部署时,max-concurrency建议略小于GPU极限,保证稳定性和低TTFT。
    如需更细致的调优建议或代码示例,请告知。

Sources:

输入3000、输出1000的同一命令,连续执行多次,第2-N次的TTFT基本一致,比第1次有明显提高。

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:30:34 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:49<00:00,  4.95s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  49.52     
Total input tokens:                      30533     
Total generated tokens:                  1048      
Request throughput (req/s):              0.20      
Output token throughput (tok/s):         21.16     
Total Token throughput (tok/s):          637.72    
---------------Time to First Token----------------
Mean TTFT (ms):                          315.81    
Median TTFT (ms):                        104.88    
P99 TTFT (ms):                           2034.00   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.68     
Median TPOT (ms):                        44.66     
P99 TPOT (ms):                           44.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.23     
Median ITL (ms):                         44.55     
P99 ITL (ms):                            47.28     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:31:42 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:46<00:00,  4.66s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  46.64     
Total input tokens:                      30533     
Total generated tokens:                  1034      
Request throughput (req/s):              0.21      
Output token throughput (tok/s):         22.17     
Total Token throughput (tok/s):          676.80    
---------------Time to First Token----------------
Mean TTFT (ms):                          89.32     
Median TTFT (ms):                        89.74     
P99 TTFT (ms):                           116.36    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.68     
Median TPOT (ms):                        44.67     
P99 TPOT (ms):                           44.87     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.23     
Median ITL (ms):                         44.58     
P99 ITL (ms):                            46.35     
==================================================
root@2f77277da063:/vllm-workspace# 

输入7000、输出3000的同一命令,连续执行多次,所有结果的TTFT都比较大,第2-N次相较第1次也没有提升。

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:35:13 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:15<00:00,  7.54s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  75.38     
Total input tokens:                      71443     
Total generated tokens:                  895       
Request throughput (req/s):              0.13      
Output token throughput (tok/s):         11.87     
Total Token throughput (tok/s):          959.60    
---------------Time to First Token----------------
Mean TTFT (ms):                          3545.60   
Median TTFT (ms):                        3496.10   
P99 TTFT (ms):                           5686.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.11     
Median TPOT (ms):                        45.08     
P99 TPOT (ms):                           45.33     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.60     
Median ITL (ms):                         45.01     
P99 ITL (ms):                            46.92     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:36:48 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.56     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          5200.45   
Median TTFT (ms):                        5637.88   
P99 TTFT (ms):                           6967.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.09     
P99 TPOT (ms):                           45.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.63     
Median ITL (ms):                         45.03     
P99 ITL (ms):                            46.49     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:39:31 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.60     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          5202.14   
Median TTFT (ms):                        5639.77   
P99 TTFT (ms):                           6966.44   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.15     
Median TPOT (ms):                        45.09     
P99 TPOT (ms):                           45.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.65     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            46.49     
==================================================
root@2f77277da063:/vllm-workspace# 

你观察到:输入3000、输出1000时,连续多次执行,TTFT(首token延迟)第2-N次明显提升;而输入7000、输出3000时,连续多次执行,TTFT始终很大且没有提升。这是vLLM推理的典型现象,主要原因如下:

  1. 对于较短输入(如3000),vLLM的prefix caching(前缀缓存)机制能在第1次推理后缓存KV,后续相同/相似输入能直接命中缓存,大幅提升TTFT(如你看到的第2-N次TTFT显著下降)相关讨论
  2. 对于超长输入(如7000),由于KV cache空间有限,或者缓存策略未能命中,prefix caching效果不明显,导致每次都需完整prefill,TTFT始终很高官方说明

如需进一步分析缓存命中率、KV cache设置或调优建议,请告知。

Sources:

如何查看 vLLM的prefix caching(前缀缓存)效果?

我看到服务端的日志有如下输出

INFO 06-12 01:40:15 [loggers.py:111] Engine 000: Avg prompt throughput: 744.2 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 41.1%

这个数据怎么分析?越高越好?

你看到的“Prefix cache hit rate: 41.1%”就是vLLM前缀缓存(prefix caching)命中率,表示最近1k次KV-cache块查询中有41.1%直接命中缓存,能跳过重复计算。命中率越高,说明prefix caching效果越好,推理TTFT和吞吐量提升越明显。官方文档说明

一般来说,命中率高(如>50%)代表大量请求共享前缀,缓存利用充分;命中率低则说明请求前缀差异大或缓存空间不足。你可以通过调整业务场景、缓存参数等提升命中率。如需进一步分析或Prometheus监控方法,请告知。

Sources:

有几点疑问:
1、1并发时的Total Token throughput (tok/s):1006.16,5并发时Total Token throughput (tok/s):479.70 // 5并发的吞吐反而小了
2、指定的输出长度是1000,但结果的1并发的Total generated tokens:86 // 输出长度的指定值好像没有用

1并发:

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
INFO 06-12 02:11:07 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.90s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.90      
Total input tokens:                      3834      
Total generated tokens:                  86        
Request throughput (req/s):              0.26      
Output token throughput (tok/s):         22.07     
Total Token throughput (tok/s):          1006.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          81.52     
Median TTFT (ms):                        81.52     
P99 TTFT (ms):                           81.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.85     
Median TPOT (ms):                        44.85     
P99 TPOT (ms):                           44.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.33     
Median ITL (ms):                         44.71     
P99 ITL (ms):                            49.85     
==================================================
root@2f77277da063:/vllm-workspace# 

5并发:

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 5     --max-concurrency 5
INFO 06-12 02:16:29 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=5, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=5, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 5
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.19s/it]
============ Serving Benchmark Result ============
Successful requests:                     5         
Benchmark duration (s):                  40.94     
Total input tokens:                      18173     
Total generated tokens:                  1468      
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         35.85     
Total Token throughput (tok/s):          479.70    
---------------Time to First Token----------------
Mean TTFT (ms):                          176.67    
Median TTFT (ms):                        177.65    
P99 TTFT (ms):                           238.80    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.11     
Median TPOT (ms):                        50.94     
P99 TPOT (ms):                           52.04     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.76     
Median ITL (ms):                         47.60     
P99 ITL (ms):                            52.29     
==================================================
root@2f77277da063:/vllm-workspace# 
  1. 并发数增加但Total Token throughput反而下降:这是因为并发数提升时,vLLM会批量处理更多请求,但每个请求的输入/输出token总量也增加,导致单次推理耗时变长,GPU资源利用率未必线性提升,反而可能因调度、内存等瓶颈导致吞吐下降。这在大输入/输出、显存有限或batch size过大时尤为明显。相关说明

  2. 指定输出长度1000但实际只生成86:vLLM的--random-output-len只是“最大输出长度”上限,实际生成token数会受模型生成逻辑、early stop(如遇到eos token)、采样参数等影响,通常不会每次都正好生成1000。你可以检查生成内容是否提前遇到eos或其他终止条件。官方说明

需要更详细的调优建议或参数解释吗?

Sources:

如果压测命令中的请求数和最大并发数设置为相同值(–num-prompts = --max-concurrency),是不是就可以省略max-concurrency了?

不能省略。--num-prompts 只是本次基准测试要发起的总请求数,而 --max-concurrency 控制的是同时并发执行的最大请求数。即使两者设置为相同值,--max-concurrency 依然决定了调度和并发行为,省略后会使用默认值(通常较小),可能导致实际并发数低于你的预期。官方参数说明

建议:如需精确控制并发,务必显式设置 --max-concurrency,否则无法保证所有请求同时被调度。

Sources:

如果获取压测的实际并发数,而非最大并发数

vLLM服务端的日志会实时输出“Running: X reqs”,其中X即为当前实际并发请求数,而不是最大并发数。你可以通过观察日志中该字段的变化,获取每一时刻的真实并发量。官方metrics文档

如果需要程序化采集,可通过Prometheus监控接口获取vllm:num_requests_running指标,反映实际并发请求数。
需要具体采集代码或Prometheus配置示例吗?

Sources:

好的,麻烦提供一下配置示例。有单独的exporter吗

vLLM 支持通过 Prometheus exporter 暴露实际并发数等指标。你只需启动 vLLM 服务端,/metrics 路径会自动暴露 Prometheus 格式的监控数据,无需单独配置 exporter。官方文档说明

Prometheus 配置示例(prometheus.yml)如下,采集 vLLM 监控数据:

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']  # 替换为你的 vLLM 服务地址

Grafana 可直接添加 Prometheus 数据源并导入官方 dashboard,无需额外 exporter。官方配置流程

Sources:

[Bugfix] Disable prefix caching by default for benchmark by cascade812 · Pull Request #18771 · vllm-project/vllm · GitHub 在进行benchmark压测时是需要 Disable prefix cache 吗?开启或关闭的影响是什么?

在进行benchmark压测时,建议关闭prefix cache(前缀缓存),否则会导致延迟数据失真,无法反映真实的推理性能。vLLM官方已在PR #18771中明确:prefix cache会让benchmark结果(如TTFT)虚高,需显式禁用以获得准确基线。

开启prefix cache会显著降低重复前缀请求的TTFT和吞吐延迟,但这并不代表模型本身的底层推理速度,只反映缓存命中场景下的加速效果。关闭后可测得“冷启动”或最坏情况性能,更具参考价值。

Sources: