使用以下2种方式，获得的结果有很大差异

XiaoDouGeGe · June 13, 2025, 1:58am

在进行benchmark压测时，除了关闭prefix cache（前缀缓存），还有哪些建议的配置？哪些配置在模型服务端、哪些配置在压测客户端？

RunLLM · June 13, 2025, 1:58am

我的理解是，除了关闭prefix cache，vLLM官方推荐在benchmark压测时还应注意以下配置：

模型服务端：建议关闭日志（如--disable-log-requests）、设置合适的--max-concurrency、--gpu-memory-utilization（如0.9）、--max-model-len、--max-num-seqs等参数，确保资源利用最大化且不受日志干扰。官方建议
压测客户端：建议设置--num-prompts、--request-rate、--burstiness等，模拟不同负载场景。可通过--profile开启性能分析，但仅用于开发调优。benchmark参数说明

如需详细参数列表或推荐组合，请告知。

Sources:

benchmark_serving.py参数说明

maodou-ma · June 13, 2025, 3:19am

1、请告知详细参数和推荐组合。
2、服务端日志显示有Running和Waiting的请求数，由哪些因素导致request处于waiting状态，总请求数超过了服务的max-len？
3、前文提到第2-N次压测结果优于第1次压测结果（开启了prefix cache），是由于开启了prefix cache；是不是意味着（开启了prefix cache）第1次的压测结果就相当于是关闭prefix cache的压测结果？

RunLLM · June 13, 2025, 3:20am

推荐参数组合：
服务端建议：

--disable-prefix-caching（关闭前缀缓存，测真实性能）
--disable-log-requests（关闭详细日志，减少干扰）
--max-concurrency（如8/16/32，视显存和模型大小调整）
--gpu-memory-utilization 0.9（高利用率但留足余量）
--max-model-len（如4096/8192/16384，按业务需求）
--max-num-seqs（如256/512，适当增大提升批处理）
客户端建议：
--num-prompts（如128/256/512，模拟高负载）
--request-rate（如inf或指定QPS，模拟不同场景）
--max-concurrency（与服务端一致或略低）
详细参数说明见官方文档。

Waiting状态原因：
日志中的Waiting表示请求已到达但未被调度执行，常见原因有：

并发请求数已达--max-concurrency上限
KV cache空间不足（如总token数超出--max-model-len或显存限制）
服务端调度策略优先级导致排队
总请求token数超出max-model-len时，部分请求会被等待或拒绝。相关说明

prefix cache影响：
是的，开启prefix cache时，第1次推理会完整prefill，后续相同/相似前缀请求可直接命中缓存，TTFT显著下降。因此，第1次压测结果等价于关闭prefix cache的表现，后续则体现缓存加速效果。官方说明

Sources:

性能优化与Waiting机制

maodou-ma · June 13, 2025, 3:42am

压测命令中的数据集是随机的（–dataset_name random），当输入7000、输出1000时，TTFT第2-n次优于第1次；继续调整为输入7000、输出3000，第1次的TTFT就很好，相当于是延续了7000/1000的prefix cache效果；但继续调整为输入10000、输出3000，第1次的TTFT就很差。感觉“输入”会很明显影响prefix cache的效果，但输入的数据集是随机的，随机的怎么会影响力这么大？请详细说明一下数据集的数据示例和使用机制，以及“输入、输出、数据集、prefix cache”之间的影响关系。

RunLLM · June 13, 2025, 3:42am

你的理解是正确的：prefix cache（前缀缓存）对TTFT的提升高度依赖于输入数据的“重复度”。

数据集机制：--dataset_name random时，vLLM基准脚本会为每个请求随机生成输入（prompt），每个prompt的token内容都是随机采样，几乎没有重复。代码参考
prefix cache与输入长度的关系：

当输入较短（如7000），即使是随机数据，部分token片段可能偶尔重复，prefix cache有小概率命中，后续请求TTFT略有下降。
输入变长（如10000），随机性更高，重复片段概率极低，prefix cache几乎无命中，TTFT始终很高。
如果你用的是固定数据集（如ShareGPT、Sonnet），且多次请求内容高度重复，prefix cache命中率会大幅提升，TTFT显著下降。官方说明

数据集示例：

random模式下，输入类似：[123, 456, 789, ...]（每次都不同）
sharegpt/sonnet等数据集，输入多为真实对话或固定文本，重复度高

影响关系总结：

prefix cache只对“前缀重复”的请求有效，随机数据集几乎无效
输入越长，随机性越大，cache命中率越低，TTFT越高
若想测试prefix cache极限效果，建议用高重复度的自定义或sonnet数据集

需要代码示例或更详细的机制说明吗？

Sources:

Prefix Cache Aware Scheduling

maodou-ma · June 13, 2025, 3:51am

如果随机数据集几乎无效，那为什么会出现TTFT指标第2-n次明显优于第1次，都是使用的随机数据集

RunLLM · June 13, 2025, 3:52am

你的理解是对的：prefix cache 只对“前缀重复”的请求有效，随机数据集几乎无效。
如果你用的是完全随机的数据集，理论上每次请求的前缀都不同，prefix cache命中率应接近0，TTFT（首token延迟）第2-N次不会明显优于第1次。官方说明

但实际测试中，TTFT第2-N次优于第1次，主要原因是首次推理会触发模型权重、算子、JIT kernel等的加载和编译，后续请求则直接复用已加载内容，导致TTFT下降。这与prefix cache无关，而是PyTorch/triton等底层的“热启动”效应。相关讨论

Sources:

vllm 0.8.3 abnormal TTFT (too long) in the first serving

maodou-ma · June 13, 2025, 4:13am

1、“首次推理”是指相同输入/输出的第1次请求？还是指所有请求（不同输入/输出）的第1次请求？
2、如果是相同输入/输出的第1次请求，那后面的不同输入/输出请求是不是也能复用已加载内容？
3、如果是所有请求（不同输入/输出）的第1次请求，但从测试结果来看，每次改变输入/输出就会出现TTFT变差。

RunLLM · June 13, 2025, 4:14am

“首次推理”是指模型进程启动后收到的第一个推理请求，无论输入/输出内容是否相同，都会触发权重加载、算子/JIT kernel编译等初始化操作。相关说明
后续所有请求（无论输入/输出是否不同）都会复用已加载的权重和已编译的kernel，因此TTFT会显著下降。只有模型重启或环境变化才会再次触发“首次推理”的高延迟。
你观察到每次更换输入/输出时TTFT有波动，可能是因为部分kernel（如triton attention kernel）会根据输入shape动态JIT编译，首次遇到新shape时会有额外编译延迟，后续同shape请求则复用已编译内容，TTFT恢复正常。官方讨论

Sources:

maodou-ma · June 13, 2025, 4:23am

1、这么来看，当使用–dataset_name random时，是不是不需要关闭前缀缓存（prefix cache只对“前缀重复”的请求有效，随机数据集几乎无效）？
2、目前测试的结果TTFT还是第2-n次明显优于第1次，如果确实是“因为部分kernel（如triton attention kernel）会根据输入shape动态JIT编译，首次遇到新shape时会有额外编译延迟，后续同shape请求则复用已编译内容，TTFT恢复正常”，那么如何避免这么大的差距，或者说如何测试出稳定真实的TTFT，究竟是第1次的结果真实、还是第2-n次的结果真实？

RunLLM · June 13, 2025, 4:25am

使用–dataset_name random时，是否需要关闭prefix cache？
是的，建议关闭prefix cache。因为prefix cache只对“前缀重复”的请求有效，而random数据集每次输入都是随机生成，几乎不会有前缀命中，开启与否对性能无实际提升，反而可能影响基准测试的真实性能评估。官方说明
TTFT第2-n次明显优于第1次，如何测得稳定真实的TTFT？
TTFT首轮较高，后续下降，主要是首次遇到新shape时triton等kernel会JIT编译，后续同shape请求直接复用已编译内容。

若想测“冷启动”极限延迟，取第1次结果最真实。
若想测“热启动”或实际服务表现，取第2-n次均值更贴近线上。
若需消除JIT影响，可提前“warmup”多次，丢弃首轮结果，再统计后续TTFT。相关讨论
也可通过设置固定shape、提前dummy run等方式，减少JIT带来的波动。官方建议

如需warmup代码或更详细配置建议，请告知。

Sources:

Prefix Caching性能不稳定

maodou-ma · June 13, 2025, 4:43am

这个表述不能理解。因为真实服务的输入长度是不一样的。
例如：第1个请求，输入3000/输出1000；第2个，输入3500/输出1000；第3个请求，输入5000/输出1500；第4次请求，输入3000/输出500。这4次请求的输入没有连续相同，4次结果的TTFT是不是都会很差？
这种才是真实服务吧，也就不应该“取第2-n次均值”是不是？

RunLLM · June 13, 2025, 4:43am

你的理解是对的：实际线上服务的输入长度和内容是多变的，不同请求的TTFT本身就会有较大差异。所谓“取第2-n次均值更贴近线上”，只适用于同一输入shape/内容反复请求的场景（如压测warmup或缓存命中），而真实服务下每个请求shape都可能不同，TTFT波动是正常现象。相关讨论

因此，线上真实表现应统计所有请求的TTFT分布（如均值、P95、P99等），而不是简单取第2-n次均值。只有在压测时，为了消除首次JIT编译等冷启动影响，才会丢弃第1次结果。实际服务应关注整体分布和tail latency。

Sources:

Prefix Caching性能不稳定

maodou-ma · June 13, 2025, 5:03am

再次确认几点：
1、prefix cache 只对“前缀重复”的请求有效，随机数据集（–dataset_name random）几乎无效；
2、上述测试结果（TTFT第2-N次优于第1次），原因是JIT编译影响，和prefix cache无关；
3、线上真实表现应统计所有请求的TTFT分布 （如均值、P95、P99等），而不是简单取第2-n次均值；
4、以下命令连续执行2次，是不是只有第1次的第1个请求是“例外”（JIT编译，具有较差的TTFT），后续9个请求（4+5）都应该有好的TTFT？

python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 3500     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 5     --max-concurrency 5

RunLLM · June 13, 2025, 5:04am

你的理解完全正确：

prefix cache只对“前缀重复”的请求有效，随机数据集（–dataset_name random）几乎无效。官方说明
TTFT第2-N次优于第1次，主要是JIT编译（如triton kernel）导致的首次延迟，与prefix cache无关。相关讨论
线上真实表现应统计所有请求的TTFT分布（如均值、P95、P99），而不是简单取第2-n次均值。
连续执行2次上述命令，只有第1次的第1个请求会因JIT编译TTFT偏高，后续9个请求（4+5）TTFT都应较好，除非遇到新的输入shape导致再次JIT。官方讨论

Sources:

maodou-ma · June 13, 2025, 5:44am

有以下5次连续的测试：“输入7000/输出3000” 1次、“输入12000/输出1000” 4次，按照前面的理论依据，“输入12000/输出1000”的第1次测试的TTFT应该很差，但真实情况并不是。这是什么原因？

root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 7000     --random-output-len 3000     --random-range-ratio 0.1     --num-prompts 10     --max-concurrency 10
INFO 06-12 05:09:59 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=10, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=10, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=7000, random_output_len=3000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 10
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:31<00:00,  3.13s/it]
============ Serving Benchmark Result ============
Successful requests:                     10        
Benchmark duration (s):                  31.26     
Total input tokens:                      71443     
Total generated tokens:                  941       
Request throughput (req/s):              0.32      
Output token throughput (tok/s):         30.11     
Total Token throughput (tok/s):          2315.84   
---------------Time to First Token----------------
Mean TTFT (ms):                          9755.54   
Median TTFT (ms):                        6965.07   
P99 TTFT (ms):                           25874.90  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          174.64    
Median TPOT (ms):                        200.87    
P99 TPOT (ms):                           256.89    
---------------Inter-token Latency----------------
Mean ITL (ms):                           174.30    
Median ITL (ms):                         58.14     
P99 ITL (ms):                            1532.39   
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 12000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1 
INFO 06-12 05:11:30 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=12000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.39s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  4.39      
Total input tokens:                      12453     
Total generated tokens:                  94        
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         21.43     
Total Token throughput (tok/s):          2860.54   
---------------Time to First Token----------------
Mean TTFT (ms):                          141.79    
Median TTFT (ms):                        141.79    
P99 TTFT (ms):                           141.79    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.62     
Median TPOT (ms):                        45.62     
P99 TPOT (ms):                           45.62     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.13     
Median ITL (ms):                         45.58     
P99 ITL (ms):                            46.38     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 12000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
INFO 06-12 05:14:48 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=12000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.37s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  4.37      
Total input tokens:                      12453     
Total generated tokens:                  94        
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         21.50     
Total Token throughput (tok/s):          2870.11   
---------------Time to First Token----------------
Mean TTFT (ms):                          128.15    
Median TTFT (ms):                        128.15    
P99 TTFT (ms):                           128.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.61     
Median TPOT (ms):                        45.61     
P99 TPOT (ms):                           45.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.12     
Median ITL (ms):                         45.56     
P99 ITL (ms):                            46.85     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 12000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
INFO 06-12 05:15:39 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=12000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.37s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  4.37      
Total input tokens:                      12453     
Total generated tokens:                  94        
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         21.51     
Total Token throughput (tok/s):          2870.89   
---------------Time to First Token----------------
Mean TTFT (ms):                          129.96    
Median TTFT (ms):                        129.96    
P99 TTFT (ms):                           129.96    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.57     
Median TPOT (ms):                        45.57     
P99 TPOT (ms):                           45.57     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.09     
Median ITL (ms):                         45.54     
P99 ITL (ms):                            45.89     
==================================================
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# 
root@2f77277da063:/vllm-workspace# python3 /vllm-workspace/benchmarks/benchmark_serving.py     --backend openai-chat     --model /data/models/Qwen2.5-72B     --served-model-name Qwen2.5-72B     --endpoint /v1/chat/completions     --port 8080     --dataset_name random     --random-input-len 12000     --random-output-len 1000     --random-range-ratio 0.1     --num-prompts 1     --max-concurrency 1
INFO 06-12 05:17:55 [__init__.py:239] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url=None, host='127.0.0.1', port=8080, endpoint='/v1/chat/completions', dataset_name='random', dataset_path=None, max_concurrency=1, model='/data/models/Qwen2.5-72B', tokenizer=None, use_beam_search=False, num_prompts=1, logprobs=None, request_rate=inf, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, save_detailed=False, append_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=12000, random_output_len=1000, random_range_ratio=0.1, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, top_p=None, top_k=None, min_p=None, temperature=None, tokenizer_mode='auto', served_model_name='Qwen2.5-72B', lora_modules=None)
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 1
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:04<00:00,  4.37s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  4.37      
Total input tokens:                      12453     
Total generated tokens:                  94        
Request throughput (req/s):              0.23      
Output token throughput (tok/s):         21.51     
Total Token throughput (tok/s):          2871.10   
---------------Time to First Token----------------
Mean TTFT (ms):                          130.70    
Median TTFT (ms):                        130.70    
P99 TTFT (ms):                           130.70    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.56     
Median TPOT (ms):                        45.56     
P99 TPOT (ms):                           45.56     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.08     
Median ITL (ms):                         45.53     
P99 ITL (ms):                            45.87     
==================================================
root@2f77277da063:/vllm-workspace#

RunLLM · June 13, 2025, 5:45am

你遇到的现象是：

先跑“输入7000/输出3000”1次，再跑“输入12000/输出1000”多次，理论上“输入12000/输出1000”的第1次TTFT应该很高（因为首次遇到新shape会JIT编译），但实际TTFT很低，和后续多次几乎一样。

原因分析：

vLLM的JIT kernel（如triton attention kernel）通常会根据“最大输入长度”或“最大batch shape”进行JIT编译和缓存。你前面已经跑过一次“输入7000/输出3000”，此时JIT kernel已被编译，后续遇到更长的输入（如12000/1000）时，vLLM可能自动复用或扩展了已编译的kernel，或者提前分配了足够大的shape，导致没有再次触发明显的JIT延迟。相关讨论
还有一种可能，vLLM在首次推理时会“预热”或“捕获”一组常用shape的kernel，后续只要输入长度不超过已捕获的最大shape，就不会再有明显的JIT延迟。官方说明
只有当你第一次遇到更大shape（如远超之前的最大输入长度）时，才会再次JIT编译并出现高TTFT。你可以尝试先跑“输入3000”，再直接跑“输入20000”，通常会看到TTFT再次升高。

结论：

vLLM的JIT kernel缓存和shape捕获机制，使得只要之前有过大shape的推理，后续同类/更小shape都能直接复用，TTFT不会再有明显抖动。
这也是实际线上服务中，TTFT波动主要出现在首次遇到新shape或模型刚启动时，后续大多稳定。

如需进一步验证，可尝试更极端的输入长度，或重启服务端后直接用最大shape推理。

Sources:

maodou-ma · June 13, 2025, 5:48am

你是机器人还是真人在解答？前面的回答存在不合理、解释不清的地方。请求和社区专家探讨。

maodou-ma · June 13, 2025, 7:17am

测试“输入12000/输出3000”的请求后，再测试“输入7000/输出1000”的请求，同样是第1次TTFT很差，上述说明解释不通这种情况。

Topic		Replies	Views
关于0.9.0和0.8.5dev948执行时的差异： General	9	78	May 30, 2025
流式响应中 usage 字段始终为 None，无法获取 Token 使用量 General	0	4	June 10, 2025
为什么在v1架构中tokenier使用线程池而不使用进程池 General	3	48	May 26, 2025
Vllm的seed是怎么做到动态随机的？ General	19	70	May 19, 2025
vllm的V1为什么删除了multi step特性 General	3	63	June 11, 2025

使用以下2种方式，获得的结果有很大差异

Related topics