Why Does Decode Forward on PP Stage 0 Appear to Precede Prefill Forward on PP Stage 1 for the Same Request?

Summary
We instrument vLLM 0.20’s GPU execution phases using time.time_ns() (CPU wall clock) across all PP worker processes. For a single autoregressive request running with tp=2, pp=2, we observe the following on the CPU timeline:
PP stage 0 GPU: [prefill forward] ===== [decode forward]
PP stage 1 GPU: [prefill forward] ======== [sample] → [decode forward]
↑ decode launch on stage 0 happens BEFORE
prefill forward launch on stage 1
This appears to violate autoregressive ordering: you should not be able to compute token t+1 before token t is sampled. Is this a bug in our instrumentation, or can this actually happen?
Root Cause: CPU Launch Time ≠ GPU Execution Time
Per GPU, the order is correct. On each GPU’s CUDA stream, prefill forward completes before decode forward begins. The apparent disorder is purely a CPU-time artifact:
What we measure
time.time_ns() → CPU kernel launch time
CPU on stage 0 launches decode forward at time T1
CPU on stage 1 launches prefill forward at time T2 > T1
The NCCL broadcast of prev_sampled_token_ids from stage 1 to stage 0 is inserted into stage 0’s CUDA stream before the decode forward kernel. On the GPU, the decode kernel reads the correct token. On the CPU, the launch happens earlier because torch.cuda.Event is not used to serialize the measurement.
Why vLLM’s Async Design Allows This
vLLM 0.20’s step_with_batch_queue (activated when pipeline_parallel_size > 1) allows the scheduler to issue a decode step before the prefill’s update_from_output completes. The engine core calls schedule() for decode while the prefill is still draining through downstream PP stages. This means:

  1. The CPU on stage 0 launches decode forward at T1
  2. The CPU on stage 1 launches prefill forward at T2 > T1
  3. But on the GPU, the NCCL receive for the sampled token is sequenced before the decode forward, preserving correctness

This is not a bug in vLLM or a violation of autoregressive ordering. The apparent out-of-order CPU launch times are an artifact of using time.time_ns() (CPU wall clock) rather than GPU-side synchronization. vLLM’s async scheduling allows the CPU on PP stage 0 to launch the decode forward before the CPU on PP stage 1 launches the prefill forward, but on the GPU, the NCCL broadcast of sampled tokens is inserted into the CUDA stream before the decode kernel, ensuring correct execution order. Thus, the GPU always processes token t before token t+1, preserving autoregressive correctness—your instrumentation is simply observing CPU launch order, not actual GPU execution order. See Model Runner V2 Design and related discussion.

Would you like a step-by-step breakdown of how async scheduling and CUDA stream ordering guarantee correctness in this scenario?

Sources:

Hint: Mention @RunLLM in the post for followups.