Question: Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?
Context:
I’m profiling vLLM’s chunked-prefill optimization for long sequences (e.g., 8k tokens split into 4×2k chunks). Profiling shows near-identical computation time per chunk, but theoretically, attention complexity for later chunks (e.g., chunk 3) should grow linearly as they attend to all preceding KV history (chunks 0→3).
It seems likely that vLLM’s chunked-prefill achieves near-constant per-chunk computation time, even as attention complexity theoretically grows, due to a combination of hardware and software optimizations. Specifically, vLLM’s PagedAttention kernel is highly optimized for memory access and parallelism, allowing the GPU to efficiently process large attention matrices by maximizing memory bandwidth and compute utilization. This can mask the expected $O(n^2)$ growth, especially when the GPU is not fully saturated for smaller chunks, so later, larger chunks do not proportionally increase wall-clock time. Additionally, chunked-prefill batches prefill and decode requests together, further improving GPU utilization and smoothing out per-chunk latency, as described in Optimization and Tuning and Chunked Prefill Design.
My understanding is that vLLM does not implicitly limit attention scope (unless you enable windowed attention), and the mathematical equivalence to full-sequence attention is preserved. The observed uniform chunk latency is not an artifact of attention scope reduction, but rather a result of efficient kernel design, memory layout (PagedAttention’s block-based KV cache), and the GPU’s ability to overlap compute and memory operations. Profiler measurements may also “flatten” differences due to batching and kernel launch overheads, but the main effect is from the hardware and kernel optimizations, not from a change in attention semantics. Would you like a step-by-step breakdown of the kernel and memory optimizations involved?
My test used only one 12k-token request without any decode requests during prefill, so the latency uniformity can’t be explained by decode-prefill interleaving.
If GPU undersaturation masks complexity for 2k chunks, will increasing chunk size to 4k/8k cause attention computation time to scale linearly? I want to verify if this matches theoretical expectations in vLLM’s design.
It seems likely that if your GPU is not fully saturated with 2k-token chunks, increasing the chunk size to 4k or 8k should eventually reveal the expected linear (or quadratic) scaling in attention computation time per chunk, as the $O(n^2)$ cost of full attention becomes the bottleneck. This is consistent with vLLM’s design, where PagedAttention does not limit attention scope and chunked-prefill simply splits the prefill into manageable batches for scheduling—not for reducing attention complexity. See Optimization and Tuning and PagedAttention kernel design.
My understanding is that, as chunk size increases and the GPU becomes saturated, you should observe per-chunk latency increase in line with the theoretical $O(n^2)$ cost, matching the mathematical expectations for full-sequence attention. Would you like a more detailed explanation of how PagedAttention and chunked-prefill interact with GPU utilization and scaling?