Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

Question: Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

Context:
I’m profiling vLLM’s chunked-prefill optimization for long sequences (e.g., 8k tokens split into 4×2k chunks). Profiling shows near-identical computation time per chunk, but theoretically, attention complexity for later chunks (e.g., chunk 3) should grow linearly as they attend to all preceding KV history (chunks 0→3).


Below is my profiler screenshot for a 12k-token sequence processed via chunked-prefill with 6 chunks of 2048 tokens each

Key Observations:

  1. Uniform chunk latency:

    • Chunks 0→3 each take ~X ms (profiler data), despite chunk 3 requiring attention over 8k tokens.
    • This contradicts the expected $O(n^2)$ attention cost growth for later chunks.
  2. KV Cache behavior confirmed:

    • Prefix caching is disabled; each chunk computes full attention over all prior KV.
    • PagedAttention ensures global attention coverage (chunk 3 attends to chunks 0→3).

Suspected Optimizations:

I hypothesize vLLM mitigates latency growth via:

  • Hardware-level parallelism: GPU tensor cores optimize large matrix multiplications ($Q \times K^T$), masking compute growth.
  • Memory access optimization: PagedAttention’s contiguous block loading minimizes HBM latency for historical KV fetches.
  • Compute-I/O overlap: Async prefetching of historical KV blocks during current chunk computation.

Core Question:

How does vLLM achieve constant-time chunk computation while maintaining mathematical equivalence to full-sequence attention? Specifically:

  • Does vLLM implicitly limit attention scope (e.g., windowed attention)? [My tests confirm full attention.]
  • Are there kernel-level optimizations (e.g., fused ops) that amortize linear complexity growth?
  • Is the consistency an artifact of profiler measurement (e.g., batched execution masking latency)?

It seems likely that the near-constant computation time per chunk in vLLM’s chunked-prefill, despite theoretically increasing attention complexity, is due to a combination of hardware and software optimizations. Specifically, vLLM’s PagedAttention kernel is highly optimized for memory access and parallelism, allowing the GPU to efficiently process large attention matrices by leveraging high memory bandwidth and parallel compute resources. This can mask the expected $O(n^2)$ growth in attention cost, especially when chunk sizes are large and the GPU is well-utilized. The kernel design ensures that even as the number of keys grows, the memory access pattern and compute are optimized to keep per-chunk latency nearly flat, as described in vLLM Paged Attention docs and Optimization and Tuning.

My understanding is that this effect is not due to implicit windowing or limited attention scope—full attention is still computed for each chunk, as you confirmed. Instead, the uniformity in chunk latency is a result of efficient batching, memory layout, and possibly overlapping compute and I/O, which allow the GPU to hide the increased arithmetic cost. Profiler artifacts (e.g., chunked/batched execution) may also contribute to the appearance of constant time, but the core reason is the highly optimized attention kernel and memory management in vLLM’s implementation.

Would you like a more detailed breakdown of the kernel-level optimizations or memory access patterns?

Sources: