Does LLM.generate() differentiate between Prefill and Decode phases? Where to insert timers?

I am working with LLM.generate() from vLLM and would like to measure the time spent in the Prefill and Decode phases separately.

After reading the code, I understand that generate() internally calls _run_engine(), which loops over self.llm_engine.step() to perform inference.

My questions are:

  1. Does llm.generate() internally distinguish between the Prefill and Decode phases?

  2. If this distinction exists, where is the best place in the codebase to insert timing measurements for these two phases?

    • I would like to print or record prefill_time and decode_time separately for each prompt to generate().
    • Is LLMEngine.step() in vllm/engine/llm_engine.py the recommended place to insert such timing logic?

This information would be very helpful for latency profiling and benchmarking, especially in performance-sensitive applications.

Thank you!

vLLM does distinguish between Prefill and Decode phases, but the LLM.generate() API does not directly expose separate timing for these phases. The recommended way to measure prefill and decode times is to instrument the LLMEngine.step() method in vllm/engine/llm_engine.py, as this is where the engine processes both phases. You can insert timing logic around the relevant code blocks for prefill and decode within this method to record their durations per request. This approach aligns with the internal profiling and metrics design described in the documentation and codebase, where intervals like prefill and decode are calculated based on event timestamps recorded in the engine core and processed in the frontend or output handler loop.

For more advanced or automated profiling, vLLM provides profiling scripts and examples (see profiling.py and related docs), but for custom, per-request timing, modifying LLMEngine.step() is the most direct and supported method according to the official documentation and code comments. There is no built-in API parameter in generate() to return prefill/decode times directly as of now. See vllm.engine.llm_engine LLMEngine.step(), metrics design doc, and profiling example for further details.

Would you like a step-by-step guide or code snippet for adding this timing logic?

Sources:

Yes, I need a step-by-step guide or code snippet for adding this timing logic.

Certainly! Here’s a step-by-step guide and code snippet for adding prefill and decode timing logic to vLLM by instrumenting LLMEngine.step() in vllm/engine/llm_engine.py:

Step 1: Locate LLMEngine.step()

  • Open vllm/engine/llm_engine.py and find the step() method. This method schedules, executes, and processes both prefill and decode phases for each batch of requests (source).

Step 2: Insert Timing Logic

  • Use Python’s time.monotonic() or time.perf_counter() to record timestamps before and after the prefill and decode operations. The prefill phase typically occurs in the first step for each request, while subsequent steps are decode (metrics design).

Step 3: Example Code Snippet
Below is a simplified example of how you might add timing logic. Adjust the placement based on your model’s scheduling and batch structure.

import time

class LLMEngine:
    # ... existing code ...

    def step(self):
        # Start of step
        step_start = time.monotonic()
        # Schedule and execute prefill
        if self.is_prefill_phase():
            prefill_start = time.monotonic()
            # ... prefill logic ...
            prefill_end = time.monotonic()
            prefill_time = prefill_end - prefill_start
            print(f"Prefill time: {prefill_time:.6f} seconds")
        # Schedule and execute decode
        if self.is_decode_phase():
            decode_start = time.monotonic()
            # ... decode logic ...
            decode_end = time.monotonic()
            decode_time = decode_end - decode_start
            print(f"Decode time: {decode_time:.6f} seconds")
        # ... rest of step logic ...

Step 4: Per-Request Timing

  • For per-request timing, store timestamps in each request’s metadata (e.g., in the sequence group or request object) and print or log them after each phase (metrics design).

Step 5: Validate and Benchmark

  • Run your modified vLLM and observe the printed or logged prefill and decode times for each request or batch.

Let me know if you need a more detailed breakdown or help with a specific model or batch structure.

Sources:

Is this a real variable defined by vllm? Also STEP is triggered multiple times to perform token decoding, how do you get the stage status of Prefill and Decode in the STEP function.

It seems likely that is_prefill_phase and is_decode_phase are not real variables or methods defined by vLLM; they were used as placeholders in the previous example. In vLLM, the step() function is called repeatedly, and each SequenceGroup or request tracks its own phase (prefill or decode) internally. The phase can be determined by checking the status or state of each sequence/request, such as using seq_group.is_prefill() or similar properties (see code reference).

To distinguish between prefill and decode within step(), you should inspect the status of each sequence group being processed in that iteration. For example, you can check if a sequence group is in prefill by calling seq_group.is_prefill(), and otherwise treat it as decode. This allows you to time the prefill and decode phases accurately for each request as step() is triggered multiple times (metrics design).

Would you like a code snippet showing how to use seq_group.is_prefill() for timing?

Sources: