Summary
We instrument vLLM 0.20’s GPU execution phases using time.time_ns() (CPU wall clock) across all PP worker processes. For a single autoregressive request running with tp=2, pp=2, we observe the following on the CPU timeline:
PP stage 0 GPU: [prefill forward] ===== [decode forward]
PP stage 1 GPU: [prefill forward] ======== [sample] → [decode forward]
↑ decode launch on stage 0 happens BEFORE
prefill forward launch on stage 1
This appears to violate autoregressive ordering: you should not be able to compute token t+1 before token t is sampled. Is this a bug in our instrumentation, or can this actually happen?
Root Cause: CPU Launch Time ≠ GPU Execution Time
Per GPU, the order is correct. On each GPU’s CUDA stream, prefill forward completes before decode forward begins. The apparent disorder is purely a CPU-time artifact:
What we measure
time.time_ns() → CPU kernel launch time
CPU on stage 0 launches decode forward at time T1
CPU on stage 1 launches prefill forward at time T2 > T1
The NCCL broadcast of prev_sampled_token_ids from stage 1 to stage 0 is inserted into stage 0’s CUDA stream before the decode forward kernel. On the GPU, the decode kernel reads the correct token. On the CPU, the launch happens earlier because torch.cuda.Event is not used to serialize the measurement.
Why vLLM’s Async Design Allows This
vLLM 0.20’s step_with_batch_queue (activated when pipeline_parallel_size > 1) allows the scheduler to issue a decode step before the prefill’s update_from_output completes. The engine core calls schedule() for decode while the prefill is still draining through downstream PP stages. This means:
- The CPU on stage 0 launches decode forward at T1
- The CPU on stage 1 launches prefill forward at T2 > T1
- But on the GPU, the NCCL receive for the sampled token is sequenced before the decode forward, preserving correctness
