使用以下2种方式，获得的结果有很大差异

XiaoDouGeGe · June 12, 2025, 8:05am

第1种

for i in {1..10}; do
python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
done

第2种

python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10    --max-concurrency 1

num-prompts的使用有什么注意点，应该如何设置该值

RunLLM · June 12, 2025, 8:06am

num-prompts参数决定了每次基准测试请求的prompt数量。单次循环多次（如for i in {1..10} … --num-prompts 1）和一次性批量（–num-prompts 10）都能发起10个请求，但后者能让vLLM更好地批量调度，提升吞吐量和资源利用率。根据官方说明，增加num-prompts通常能提升性能，直到达到硬件极限。

建议将num-prompts设置为较大值（如1000），以便vLLM优化批处理和吞吐，但需确保不会超出GPU显存限制。实际可用最大值需结合显存和模型大小多次测试确定。如需更详细的调优建议，请告知。

Sources:

XiaoDouGeGe · June 12, 2025, 8:18am

第1种方式的结果：10次结果都很稳定，TTFT在100ms左右。

root@2f77277da063:/vllm-workspace# for i in {1..10}; do
> python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
> done
INFO 06-12 01:08:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.48      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.81     
Total Token throughput (tok/s):          2026.39   
---------------Time to First Token----------------
Mean TTFT (ms):                          101.93    
Median TTFT (ms):                        101.93    
P99 TTFT (ms):                           101.93    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.07     
Median TPOT (ms):                        45.07     
P99 TPOT (ms):                           45.07     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.47     
Median ITL (ms):                         45.01     
P99 ITL (ms):                            46.96     
==================================================
INFO 06-12 01:08:37 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
  0%|                                                                                                                                                                             | 0/1 [00:00<?, ?100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.77     
Total Token throughput (tok/s):          2022.51   
---------------Time to First Token----------------
Mean TTFT (ms):                          106.18    
Median TTFT (ms):                        106.18    
P99 TTFT (ms):                           106.18    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.10     
Median TPOT (ms):                        45.10     
P99 TPOT (ms):                           45.10     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.51     
Median ITL (ms):                         44.95     
P99 ITL (ms):                            47.90     
==================================================
INFO 06-12 01:08:50 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.48s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.48      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.82     
Total Token throughput (tok/s):          2026.85   
---------------Time to First Token----------------
Mean TTFT (ms):                          105.31    
Median TTFT (ms):                        105.31    
P99 TTFT (ms):                           105.31    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.02     
Median TPOT (ms):                        45.02     
P99 TPOT (ms):                           45.02     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.42     
Median ITL (ms):                         44.98     
P99 ITL (ms):                            45.55     
==================================================
INFO 06-12 01:09:04 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.78     
Total Token throughput (tok/s):          2023.20   
---------------Time to First Token----------------
Mean TTFT (ms):                          103.25    
Median TTFT (ms):                        103.25    
P99 TTFT (ms):                           103.25    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.13     
P99 TPOT (ms):                           45.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            47.85     
==================================================
INFO 06-12 01:09:18 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.79     
Total Token throughput (tok/s):          2024.62   
---------------Time to First Token----------------
Mean TTFT (ms):                          100.20    
Median TTFT (ms):                        100.20    
P99 TTFT (ms):                           100.20    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.13     
P99 TPOT (ms):                           45.13     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.54     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            46.50     
==================================================
INFO 06-12 01:09:32 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.53s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.54      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.28      
Output token throughput (tok/s):         21.50     
Total Token throughput (tok/s):          1997.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          112.50    
Median TTFT (ms):                        112.50    
P99 TTFT (ms):                           112.50    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.61     
Median TPOT (ms):                        45.61     
P99 TPOT (ms):                           45.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.01     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            55.07     
==================================================
INFO 06-12 01:09:46 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.79     
Total Token throughput (tok/s):          2024.17   
---------------Time to First Token----------------
Mean TTFT (ms):                          95.27     
Median TTFT (ms):                        95.27     
P99 TTFT (ms):                           95.27     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.21     
Median TPOT (ms):                        45.21     
P99 TPOT (ms):                           45.21     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.61     
Median ITL (ms):                         45.06     
P99 ITL (ms):                            48.05     
==================================================
INFO 06-12 01:10:00 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.50s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.50      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.70     
Total Token throughput (tok/s):          2016.01   
---------------Time to First Token----------------
Mean TTFT (ms):                          108.49    
Median TTFT (ms):                        108.49    
P99 TTFT (ms):                           108.49    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.22     
Median TPOT (ms):                        45.22     
P99 TPOT (ms):                           45.22     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.63     
Median ITL (ms):                         45.11     
P99 ITL (ms):                            47.00     
==================================================
INFO 06-12 01:10:14 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.78     
Total Token throughput (tok/s):          2023.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          103.82    
Median TTFT (ms):                        103.82    
P99 TTFT (ms):                           103.82    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.11     
Median TPOT (ms):                        45.11     
P99 TPOT (ms):                           45.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.51     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            45.78     
==================================================
INFO 06-12 01:10:27 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.49s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.49      
Total input tokens:                      6984      
Total generated tokens:                  76        
Request throughput (req/s):              0.29      
Output token throughput (tok/s):         21.75     
Total Token throughput (tok/s):          2020.42   
---------------Time to First Token----------------
Mean TTFT (ms):                          99.14     
Median TTFT (ms):                        99.14     
P99 TTFT (ms):                           99.14     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.24     
Median TPOT (ms):                        45.24     
P99 TPOT (ms):                           45.24     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.65     
Median ITL (ms):                         45.08     
P99 ITL (ms):                            47.79     
==================================================
root@2f77277da063:/vllm-workspace#

第2种方式的结果：TTFT已经到达了5000+ms

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:14:01 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.60     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          5188.91   
Median TTFT (ms):                        5623.70   
P99 TTFT (ms):                           6942.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.27     
Median TPOT (ms):                        45.23     
P99 TPOT (ms):                           45.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.79     
Median ITL (ms):                         45.14     
P99 ITL (ms):                            48.30     
==================================================
root@2f77277da063:/vllm-workspace#

并发为1时，多个请求是按序逐个执行的吗（前一个请求结束，发起后一个请求）

RunLLM · June 12, 2025, 8:18am

是的，当你设置--max-concurrency 1时，vLLM的OpenAI兼容API会严格串行处理请求：只有当前请求完全结束后，才会开始下一个请求。你用--num-prompts 10时，10个请求会被依次排队，前一个请求生成完毕后才会处理下一个，所以后面请求的TTFT（Time To First Token）会非常高（如你看到的5000+ms），本质上是排队等待的时间。

这种模式下，多请求不会并发执行，而是“前一个结束，后一个才开始”，和你for循环10次、每次num-prompts=1效果类似，但后者每次只测一个请求，TTFT都很低，因为没有排队。
如需并发处理、降低TTFT，建议适当提高--max-concurrency，让多个请求能同时被调度和prefill，充分利用vLLM的批处理和并发能力。详细机制可参考官方说明。

需要更详细的并发调优建议吗？

Sources:

API causes slowdown in batch request handling

XiaoDouGeGe · June 12, 2025, 8:24am

测试不同并发下的TTFT，如果请求数>并发数，就会存在排队等待的情况，这时的TTFT就会变大；如果请求数=并发数，是不是就是1个并发对应1个请求，是合理的TTFT？请求数和并发数该如何设置？

RunLLM · June 12, 2025, 8:24am

你的理解是对的：如果请求数 > 并发数（max-concurrency），就会有排队，后面请求的TTFT（首token延迟）会变大，因为它们要等前面的请求处理完才能开始。如果请求数 = 并发数，每个请求都能立即被调度，TTFT最小且均衡，这时每个请求都能充分利用vLLM的批处理能力，TTFT表现最合理。相关讨论

建议：

并发数（max-concurrency）设置为你期望的最大同时处理请求数，受限于显存和硬件资源。
请求数（num-prompts）用于测试时可等于或大于并发数，便于观察排队和调度效果。
实际部署时，max-concurrency建议略小于GPU极限，保证稳定性和低TTFT。
如需更细致的调优建议或代码示例，请告知。

Sources:

Benchmarking script does not limit the maximum concurrency

XiaoDouGeGe · June 12, 2025, 8:41am

输入3000、输出1000的同一命令，连续执行多次，第2-N次的TTFT基本一致，比第1次有明显提高。

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:30:34 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:49<00:00,  4.95s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  49.52     
Total input tokens:                      30533     
Total generated tokens:                  1048      
Request throughput (req/s):              0.20      
Output token throughput (tok/s):         21.16     
Total Token throughput (tok/s):          637.72    
---------------Time to First Token----------------
Mean TTFT (ms):                          315.81    
Median TTFT (ms):                        104.88    
P99 TTFT (ms):                           2034.00   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.68     
Median TPOT (ms):                        44.66     
P99 TPOT (ms):                           44.93     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.23     
Median ITL (ms):                         44.55     
P99 ITL (ms):                            47.28     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:31:42 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:46<00:00,  4.66s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  46.64     
Total input tokens:                      30533     
Total generated tokens:                  1034      
Request throughput (req/s):              0.21      
Output token throughput (tok/s):         22.17     
Total Token throughput (tok/s):          676.80    
---------------Time to First Token----------------
Mean TTFT (ms):                          89.32     
Median TTFT (ms):                        89.74     
P99 TTFT (ms):                           116.36    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.68     
Median TPOT (ms):                        44.67     
P99 TPOT (ms):                           44.87     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.23     
Median ITL (ms):                         44.58     
P99 ITL (ms):                            46.35     
==================================================
root@2f77277da063:/vllm-workspace#

输入7000、输出3000的同一命令，连续执行多次，所有结果的TTFT都比较大，第2-N次相较第1次也没有提升。

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:35:13 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:15<00:00,  7.54s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  75.38     
Total input tokens:                      71443     
Total generated tokens:                  895       
Request throughput (req/s):              0.13      
Output token throughput (tok/s):         11.87     
Total Token throughput (tok/s):          959.60    
---------------Time to First Token----------------
Mean TTFT (ms):                          3545.60   
Median TTFT (ms):                        3496.10   
P99 TTFT (ms):                           5686.57   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.11     
Median TPOT (ms):                        45.08     
P99 TPOT (ms):                           45.33     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.60     
Median ITL (ms):                         45.01     
P99 ITL (ms):                            46.92     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:36:48 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.56     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.52    
---------------Time to First Token----------------
Mean TTFT (ms):                          5200.45   
Median TTFT (ms):                        5637.88   
P99 TTFT (ms):                           6967.19   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.13     
Median TPOT (ms):                        45.09     
P99 TPOT (ms):                           45.53     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.63     
Median ITL (ms):                         45.03     
P99 ITL (ms):                            46.49     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 1
INFO 06-12 01:39:31 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [01:33<00:00,  9.36s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  93.60     
Total input tokens:                      71443     
Total generated tokens:                  931       
Request throughput (req/s):              0.11      
Output token throughput (tok/s):         9.95      
Total Token throughput (tok/s):          773.21    
---------------Time to First Token----------------
Mean TTFT (ms):                          5202.14   
Median TTFT (ms):                        5639.77   
P99 TTFT (ms):                           6966.44   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.15     
Median TPOT (ms):                        45.09     
P99 TPOT (ms):                           45.41     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.65     
Median ITL (ms):                         45.05     
P99 ITL (ms):                            46.49     
==================================================
root@2f77277da063:/vllm-workspace#

RunLLM · June 12, 2025, 8:42am

你观察到：输入3000、输出1000时，连续多次执行，TTFT（首token延迟）第2-N次明显提升；而输入7000、输出3000时，连续多次执行，TTFT始终很大且没有提升。这是vLLM推理的典型现象，主要原因如下：

对于较短输入（如3000），vLLM的prefix caching（前缀缓存）机制能在第1次推理后缓存KV，后续相同/相似输入能直接命中缓存，大幅提升TTFT（如你看到的第2-N次TTFT显著下降）相关讨论。
对于超长输入（如7000），由于KV cache空间有限，或者缓存策略未能命中，prefix caching效果不明显，导致每次都需完整prefill，TTFT始终很高官方说明。

如需进一步分析缓存命中率、KV cache设置或调优建议，请告知。

Sources:

XiaoDouGeGe · June 12, 2025, 8:48am

如何查看 vLLM的prefix caching（前缀缓存）效果？

我看到服务端的日志有如下输出

INFO 06-12 01:40:15 [loggers.py:111] Engine 000: Avg prompt throughput: 744.2 tokens/s, Avg generation throughput: 10.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 41.1%

这个数据怎么分析？越高越好？

RunLLM · June 12, 2025, 8:48am

你看到的“Prefix cache hit rate: 41.1%”就是vLLM前缀缓存（prefix caching）命中率，表示最近1k次KV-cache块查询中有41.1%直接命中缓存，能跳过重复计算。命中率越高，说明prefix caching效果越好，推理TTFT和吞吐量提升越明显。官方文档说明

一般来说，命中率高（如>50%）代表大量请求共享前缀，缓存利用充分；命中率低则说明请求前缀差异大或缓存空间不足。你可以通过调整业务场景、缓存参数等提升命中率。如需进一步分析或Prometheus监控方法，请告知。

Sources:

vLLM metrics文档

XiaoDouGeGe · June 12, 2025, 9:38am

有几点疑问：
1、1并发时的Total Token throughput (tok/s):1006.16，5并发时Total Token throughput (tok/s):479.70 // 5并发的吞吐反而小了
2、指定的输出长度是1000，但结果的1并发的Total generated tokens:86 // 输出长度的指定值好像没有用

1并发：

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
INFO 06-12 02:11:07 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.90s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  3.90      
Total input tokens:                      3834      
Total generated tokens:                  86        
Request throughput (req/s):              0.26      
Output token throughput (tok/s):         22.07     
Total Token throughput (tok/s):          1006.16   
---------------Time to First Token----------------
Mean TTFT (ms):                          81.52     
Median TTFT (ms):                        81.52     
P99 TTFT (ms):                           81.52     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.85     
Median TPOT (ms):                        44.85     
P99 TPOT (ms):                           44.85     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.33     
Median ITL (ms):                         44.71     
P99 ITL (ms):                            49.85     
==================================================
root@2f77277da063:/vllm-workspace#

5并发：

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 5     --max-concurrency 5
INFO 06-12 02:16:29 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=5, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=5, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=3500, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 5
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:40<00:00,  8.19s/it]
============ Serving Benchmark Result ============
Successful requests:                     5         
Benchmark duration (s):                  40.94     
Total input tokens:                      18173     
Total generated tokens:                  1468      
Request throughput (req/s):              0.12      
Output token throughput (tok/s):         35.85     
Total Token throughput (tok/s):          479.70    
---------------Time to First Token----------------
Mean TTFT (ms):                          176.67    
Median TTFT (ms):                        177.65    
P99 TTFT (ms):                           238.80    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.11     
Median TPOT (ms):                        50.94     
P99 TPOT (ms):                           52.04     
---------------Inter-token Latency----------------
Mean ITL (ms):                           47.76     
Median ITL (ms):                         47.60     
P99 ITL (ms):                            52.29     
==================================================
root@2f77277da063:/vllm-workspace#

RunLLM · June 12, 2025, 9:39am

并发数增加但Total Token throughput反而下降：这是因为并发数提升时，vLLM会批量处理更多请求，但每个请求的输入/输出token总量也增加，导致单次推理耗时变长，GPU资源利用率未必线性提升，反而可能因调度、内存等瓶颈导致吞吐下降。这在大输入/输出、显存有限或batch size过大时尤为明显。相关说明
指定输出长度1000但实际只生成86：vLLM的--random-output-len只是“最大输出长度”上限，实际生成token数会受模型生成逻辑、early stop（如遇到eos token）、采样参数等影响，通常不会每次都正好生成1000。你可以检查生成内容是否提前遇到eos或其他终止条件。官方说明

需要更详细的调优建议或参数解释吗？

Sources:

XiaoDouGeGe · June 12, 2025, 11:01am

如果压测命令中的请求数和最大并发数设置为相同值（–num-prompts = --max-concurrency），是不是就可以省略max-concurrency了？

RunLLM · June 12, 2025, 11:02am

不能省略。--num-prompts 只是本次基准测试要发起的总请求数，而 --max-concurrency 控制的是同时并发执行的最大请求数。即使两者设置为相同值，--max-concurrency 依然决定了调度和并发行为，省略后会使用默认值（通常较小），可能导致实际并发数低于你的预期。官方参数说明

建议：如需精确控制并发，务必显式设置 --max-concurrency，否则无法保证所有请求同时被调度。

Sources:

benchmark_serving.py参数说明

XiaoDouGeGe · June 12, 2025, 11:10am

如果获取压测的实际并发数，而非最大并发数

RunLLM · June 12, 2025, 11:11am

vLLM服务端的日志会实时输出“Running: X reqs”，其中X即为当前实际并发请求数，而不是最大并发数。你可以通过观察日志中该字段的变化，获取每一时刻的真实并发量。官方metrics文档

如果需要程序化采集，可通过Prometheus监控接口获取vllm:num_requests_running指标，反映实际并发请求数。
需要具体采集代码或Prometheus配置示例吗？

Sources:

官方metrics文档

XiaoDouGeGe · June 12, 2025, 11:21am

好的，麻烦提供一下配置示例。有单独的exporter吗

RunLLM · June 12, 2025, 11:21am

vLLM 支持通过 Prometheus exporter 暴露实际并发数等指标。你只需启动 vLLM 服务端，/metrics 路径会自动暴露 Prometheus 格式的监控数据，无需单独配置 exporter。官方文档说明

Prometheus 配置示例（prometheus.yml）如下，采集 vLLM 监控数据：

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']  # 替换为你的 vLLM 服务地址

Grafana 可直接添加 Prometheus 数据源并导入官方 dashboard，无需额外 exporter。官方配置流程

Sources:

XiaoDouGeGe · June 13, 2025, 1:53am

[Bugfix] Disable prefix caching by default for benchmark by cascade812 · Pull Request #18771 · vllm-project/vllm · GitHub 在进行benchmark压测时是需要 Disable prefix cache 吗？开启或关闭的影响是什么？

RunLLM · June 13, 2025, 1:54am

在进行benchmark压测时，建议关闭prefix cache（前缀缓存），否则会导致延迟数据失真，无法反映真实的推理性能。vLLM官方已在PR #18771中明确：prefix cache会让benchmark结果（如TTFT）虚高，需显式禁用以获得准确基线。

开启prefix cache会显著降低重复前缀请求的TTFT和吞吐延迟，但这并不代表模型本身的底层推理速度，只反映缓存命中场景下的加速效果。关闭后可测得“冷启动”或最坏情况性能，更具参考价值。

Sources:

Disable prefix cache by default for benchmark

Topic		Replies	Views
Benchmark initial single prompt test General	2	42	October 11, 2025
How to benchmark concurrency General	26	490	September 15, 2025
Vllm bench serve not all requests are successful. whats the reason? Benchmarking	5	112	October 23, 2025
VLLM 0.10.1 benchmark do not free memory Benchmarking	13	65	November 10, 2025
Vllm Latency(TTFT) Spike on load Tesing General	9	265	August 16, 2025

使用以下2种方式，获得的结果有很大差异

Related topics