Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

chj · June 2, 2025, 9:50am

Question: Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

Context:
I’m profiling vLLM’s chunked-prefill optimization for long sequences (e.g., 8k tokens split into 4×2k chunks). Profiling shows near-identical computation time per chunk, but theoretically, attention complexity for later chunks (e.g., chunk 3) should grow linearly as they attend to all preceding KV history (chunks 0→3).

Below is my profiler screenshot for a 12k-token sequence processed via chunked-prefill with 6 chunks of 2048 tokens each

Key Observations:

Uniform chunk latency:
- Chunks 0→3 each take ~X ms (profiler data), despite chunk 3 requiring attention over 8k tokens.
- This contradicts the expected $O(n^2)$ attention cost growth for later chunks.
KV Cache behavior confirmed:
- Prefix caching is disabled; each chunk computes full attention over all prior KV.
- PagedAttention ensures global attention coverage (chunk 3 attends to chunks 0→3).

Suspected Optimizations:

I hypothesize vLLM mitigates latency growth via:

Hardware-level parallelism: GPU tensor cores optimize large matrix multiplications ($Q \times K^T$), masking compute growth.
Memory access optimization: PagedAttention’s contiguous block loading minimizes HBM latency for historical KV fetches.
Compute-I/O overlap: Async prefetching of historical KV blocks during current chunk computation.

Core Question:

How does vLLM achieve constant-time chunk computation while maintaining mathematical equivalence to full-sequence attention? Specifically:

Does vLLM implicitly limit attention scope (e.g., windowed attention)? [My tests confirm full attention.]
Are there kernel-level optimizations (e.g., fused ops) that amortize linear complexity growth?
Is the consistency an artifact of profiler measurement (e.g., batched execution masking latency)?

RunLLM · June 2, 2025, 9:51am

It seems likely that the near-constant computation time per chunk in vLLM’s chunked-prefill, despite theoretically increasing attention complexity, is due to a combination of hardware and software optimizations. Specifically, vLLM’s PagedAttention kernel is highly optimized for memory access and parallelism, allowing the GPU to efficiently process large attention matrices by leveraging high memory bandwidth and parallel compute resources. This can mask the expected $O(n^2)$ growth in attention cost, especially when chunk sizes are large and the GPU is well-utilized. The kernel design ensures that even as the number of keys grows, the memory access pattern and compute are optimized to keep per-chunk latency nearly flat, as described in vLLM Paged Attention docs and Optimization and Tuning.

My understanding is that this effect is not due to implicit windowing or limited attention scope—full attention is still computed for each chunk, as you confirmed. Instead, the uniformity in chunk latency is a result of efficient batching, memory layout, and possibly overlapping compute and I/O, which allow the GPU to hide the increased arithmetic cost. Profiler artifacts (e.g., chunked/batched execution) may also contribute to the appearance of constant time, but the core reason is the highly optimized attention kernel and memory management in vLLM’s implementation.

Would you like a more detailed breakdown of the kernel-level optimizations or memory access patterns?

Sources:

Topic		Replies	Views
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	21	June 2, 2025
Why allocating KV-caches on a per-layer basis General	5	35	August 15, 2025
Questions about PagedAttention v1 General	7	22	July 14, 2025
Question about profile run General	7	57	July 26, 2025
Should vLLM consider prefix caching when chunked prefill is enabled? General	1	91	June 24, 2025

Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

Question: Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity?

Key Observations:

Suspected Optimizations:

Core Question:

Related topics