Why does prefill use normal attention, while decode uses weight absorption in MLA?

Why does prefill use normal attention, while decode uses weight absorption in MLA?"

Prefill uses “normal” (compute-friendly) attention because the ratio of query to key/value sequence length (Sq/Skv) is small, making direct computation efficient. Decode uses weight absorption (data-movement-friendly) because Sq/Skv is large, so simulating multi-head attention with absorbed weights is more memory efficient. This distinction is based on DeepseekV2 and FlashInfer’s MLA design, as described in the vLLM MLA backend documentation and code. See MLA Common Components and DeepseekV2 paper reference.

Would you like a step-by-step breakdown of the compute and memory tradeoffs?

Sources: