Does LLM.generate() differentiate between Prefill and Decode phases? Where to insert timers?

zhy1658858023 · July 24, 2025, 9:16am

I am working with LLM.generate() from vLLM and would like to measure the time spent in the Prefill and Decode phases separately.

After reading the code, I understand that generate() internally calls _run_engine(), which loops over self.llm_engine.step() to perform inference.

My questions are:

Does llm.generate() internally distinguish between the Prefill and Decode phases?
If this distinction exists, where is the best place in the codebase to insert timing measurements for these two phases?
- I would like to print or record prefill_time and decode_time separately for each prompt to generate().
- Is LLMEngine.step() in vllm/engine/llm_engine.py the recommended place to insert such timing logic?

This information would be very helpful for latency profiling and benchmarking, especially in performance-sensitive applications.

Thank you!

RunLLM · July 24, 2025, 9:17am

vLLM does distinguish between Prefill and Decode phases, but the LLM.generate() API does not directly expose separate timing for these phases. The recommended way to measure prefill and decode times is to instrument the LLMEngine.step() method in vllm/engine/llm_engine.py, as this is where the engine processes both phases. You can insert timing logic around the relevant code blocks for prefill and decode within this method to record their durations per request. This approach aligns with the internal profiling and metrics design described in the documentation and codebase, where intervals like prefill and decode are calculated based on event timestamps recorded in the engine core and processed in the frontend or output handler loop.

For more advanced or automated profiling, vLLM provides profiling scripts and examples (see profiling.py and related docs), but for custom, per-request timing, modifying LLMEngine.step() is the most direct and supported method according to the official documentation and code comments. There is no built-in API parameter in generate() to return prefill/decode times directly as of now. See vllm.engine.llm_engine LLMEngine.step(), metrics design doc, and profiling example for further details.

Would you like a step-by-step guide or code snippet for adding this timing logic?

Sources:

zhy1658858023 · July 24, 2025, 9:20am

Yes, I need a step-by-step guide or code snippet for adding this timing logic.

RunLLM · July 24, 2025, 9:21am

Certainly! Here’s a step-by-step guide and code snippet for adding prefill and decode timing logic to vLLM by instrumenting LLMEngine.step() in vllm/engine/llm_engine.py:

Step 1: Locate LLMEngine.step()

Open vllm/engine/llm_engine.py and find the step() method. This method schedules, executes, and processes both prefill and decode phases for each batch of requests (source).

Step 2: Insert Timing Logic

Use Python’s time.monotonic() or time.perf_counter() to record timestamps before and after the prefill and decode operations. The prefill phase typically occurs in the first step for each request, while subsequent steps are decode (metrics design).

Step 3: Example Code Snippet
Below is a simplified example of how you might add timing logic. Adjust the placement based on your model’s scheduling and batch structure.

import time

class LLMEngine:
    # ... existing code ...

    def step(self):
        # Start of step
        step_start = time.monotonic()
        # Schedule and execute prefill
        if self.is_prefill_phase():
            prefill_start = time.monotonic()
            # ... prefill logic ...
            prefill_end = time.monotonic()
            prefill_time = prefill_end - prefill_start
            print(f"Prefill time: {prefill_time:.6f} seconds")
        # Schedule and execute decode
        if self.is_decode_phase():
            decode_start = time.monotonic()
            # ... decode logic ...
            decode_end = time.monotonic()
            decode_time = decode_end - decode_start
            print(f"Decode time: {decode_time:.6f} seconds")
        # ... rest of step logic ...

Step 4: Per-Request Timing

For per-request timing, store timestamps in each request’s metadata (e.g., in the sequence group or request object) and print or log them after each phase (metrics design).

Step 5: Validate and Benchmark

Run your modified vLLM and observe the printed or logged prefill and decode times for each request or batch.

Let me know if you need a more detailed breakdown or help with a specific model or batch structure.

Sources:

zhy1658858023 · July 24, 2025, 9:28am

Is this a real variable defined by vllm? Also STEP is triggered multiple times to perform token decoding, how do you get the stage status of Prefill and Decode in the STEP function.

RunLLM · July 24, 2025, 9:28am

It seems likely that is_prefill_phase and is_decode_phase are not real variables or methods defined by vLLM; they were used as placeholders in the previous example. In vLLM, the step() function is called repeatedly, and each SequenceGroup or request tracks its own phase (prefill or decode) internally. The phase can be determined by checking the status or state of each sequence/request, such as using seq_group.is_prefill() or similar properties (see code reference).

To distinguish between prefill and decode within step(), you should inspect the status of each sequence group being processed in that iteration. For example, you can check if a sequence group is in prefill by calling seq_group.is_prefill(), and otherwise treat it as decode. This allows you to time the prefill and decode phases accurately for each request as step() is triggered multiple times (metrics design).

Would you like a code snippet showing how to use seq_group.is_prefill() for timing?

Sources:

see code reference

Topic		Replies	Views
优化prompt来提高模型API调用的Prefill命中率 General	3	73	July 30, 2025
How to obtain the logits of LLM General	31	1083	July 21, 2025
Trace requests through vLLM v1 General	1	84	May 29, 2025
How log kvcache usage and prefix hit rate in offline infer? General	19	310	May 13, 2025
It seems that vLLM stops due to prefill General	1	51	September 22, 2025

Does LLM.generate() differentiate between Prefill and Decode phases? Where to insert timers?

Related topics