Why does prefill use normal attention, while decode uses weight absorption in MLA?

Robin · August 5, 2025, 2:02am

Why does prefill use normal attention, while decode uses weight absorption in MLA?"

RunLLM · August 5, 2025, 2:02am

Prefill uses “normal” (compute-friendly) attention because the ratio of query to key/value sequence length (Sq/Skv) is small, making direct computation efficient. Decode uses weight absorption (data-movement-friendly) because Sq/Skv is large, so simulating multi-head attention with absorbed weights is more memory efficient. This distinction is based on DeepseekV2 and FlashInfer’s MLA design, as described in the vLLM MLA backend documentation and code. See MLA Common Components and DeepseekV2 paper reference.

Would you like a step-by-step breakdown of the compute and memory tradeoffs?

Sources:

MLA Common Components

Topic		Replies	Views
Why does computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? Scheduling	3	21	June 2, 2025
How does the forward pass in speculative decoding work? General	1	39	June 29, 2025
Computation time remain consistent across chunks in chunked-prefill despite linearly growing attention complexity? KV-Cache	1	19	June 2, 2025
Gemma 3 prefix caching in case of multimodal prompts Model Support	4	119	May 22, 2025
Minimum requirements for Disaggregated Prefilling? Disaggregated Prefilling	0	60	April 9, 2025

Why does prefill use normal attention, while decode uses weight absorption in MLA?

Related topics