Is there a way to separately measure the time spent in the prefill and decode stages in vllm offline inference

MoonkIn66 · January 29, 2026, 6:27am

Is there a way to separately track the time spent in the prefill and decode stages in vllm offline inference? My current script is as follows:

prompt_length = 4096
prompts = [generate_random_prompt(prompt_length) for _ in range(1)]

sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=512)

path = “/workspace/Qwen3-8B”

llm = LLM(model=path, trust_remote_code=True, enforce_eager=True, block_size=32,
dtype=“bfloat16”, gpu_memory_utilization=0.90, tensor_parallel_size=1)

outputs = llm.generate(prompts, sampling_params)

RunLLM · January 29, 2026, 6:27am

Yes, you can separately track prefill and decode times in vLLM offline inference by using profiling tools or by instrumenting the model code. The official vLLM examples do not provide built-in per-stage timing in the generate API, but you can use the built-in profiler (profiler_config with "profiler": "torch") to collect detailed timing, or manually add timing code around the relevant model calls. For even finer granularity, you may modify the model’s forward method to print or log timing for each stage, as discussed in this GitHub issue.

For a simple approach, wrap the llm.generate() call with timing code for total latency, and use the profiler for breakdowns. If you want to print prefill and decode times for each token, you must modify the model code as shown in the referenced issue. Would you like a step-by-step example of either method?

Sources:

Hint: Mention @RunLLM in the post for followups.

MoonkIn66 · January 29, 2026, 6:39am

i checked Simple Profiling - vLLM Docs and change my script as:

llm = LLM(model=path, trust_remote_code=True, enforce_eager=True, block_size=32,
dtype=“bfloat16”, gpu_memory_utilization=0.90, tensor_parallel_size=1,
profiler_config={“profiler”: “torch”, “torch_profiler_dir”: “./vllm_profile/longlengthcase”,},)

got an error:
TypeError: EngineArgs.init() got an unexpected keyword argument ‘profiler_config’

how to fix it？
vllm version is 0.11.0

Topic		Replies	Views
Does LLM.generate() differentiate between Prefill and Decode phases? Where to insert timers? General	5	357	July 24, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	971	May 13, 2025
Why Does Decode Forward on PP Stage 0 Appear to Precede Prefill Forward on PP Stage 1 for the Same Request? Features	1	34	May 26, 2026
It seems that vLLM stops due to prefill General	1	393	September 22, 2025
How to get the log for benchmarking Benchmarking	17	542	January 19, 2026

Is there a way to separately measure the time spent in the prefill and decode stages in vllm offline inference

Related topics