Is there a way to separately measure the time spent in the prefill and decode stages in vllm offline inference

Is there a way to separately track the time spent in the prefill and decode stages in vllm offline inference? My current script is as follows:

prompt_length = 4096
prompts = [generate_random_prompt(prompt_length) for _ in range(1)]

sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=512)

path = “/workspace/Qwen3-8B”

llm = LLM(model=path, trust_remote_code=True, enforce_eager=True, block_size=32,
dtype=“bfloat16”, gpu_memory_utilization=0.90, tensor_parallel_size=1)

outputs = llm.generate(prompts, sampling_params)

Yes, you can separately track prefill and decode times in vLLM offline inference by using profiling tools or by instrumenting the model code. The official vLLM examples do not provide built-in per-stage timing in the generate API, but you can use the built-in profiler (profiler_config with "profiler": "torch") to collect detailed timing, or manually add timing code around the relevant model calls. For even finer granularity, you may modify the model’s forward method to print or log timing for each stage, as discussed in this GitHub issue.

For a simple approach, wrap the llm.generate() call with timing code for total latency, and use the profiler for breakdowns. If you want to print prefill and decode times for each token, you must modify the model code as shown in the referenced issue. Would you like a step-by-step example of either method?

Sources:

Hint: Mention @RunLLM in the post for followups.

i checked Simple Profiling - vLLM Docs and change my script as:

llm = LLM(model=path, trust_remote_code=True, enforce_eager=True, block_size=32,
dtype=“bfloat16”, gpu_memory_utilization=0.90, tensor_parallel_size=1,
profiler_config={“profiler”: “torch”, “torch_profiler_dir”: “./vllm_profile/longlengthcase”,},)

got an error:
TypeError: EngineArgs.init() got an unexpected keyword argument ‘profiler_config’

how to fix it?
vllm version is 0.11.0