How to set different attention backend for prefill and decode stage?

stormchasingg · January 6, 2026, 6:53am

Just like “–decode-attention-backend flashinfer“ and “–prefill-attention-backend fa3“.

RunLLM · January 6, 2026, 6:54am

vLLM does not provide separate CLI options like --decode-attention-backend and --prefill-attention-backend to independently set the attention backend for decode and prefill phases. Instead, you can set the overall attention backend using the --attention-backend argument or the VLLM_ATTENTION_BACKEND environment variable, and optionally specify the FlashAttention version with --attention-config.flash_attn_version (e.g., 2 or 3) for all phases. There is no documented way to assign different backends for prefill and decode phases via CLI or config in the official documentation or API references as of now.

For more granular control, you may use the AttentionConfig class to configure backend and version, but this still applies globally, not per-phase. See AttentionConfig documentation and Quickstart guide for details. Would you like more detail on configuring attention backends or on backend selection logic?

Sources:

Hint: Mention @RunLLM in the post for followups.

stormchasingg · January 6, 2026, 6:58am

Is there any feature or PR to independently set the attention backend for decode and prefill phases?

Topic		Replies	Views
Which ATTENTION BACKEND for gpt-oss in version 0.11.0? Model Support	1	546	October 4, 2025
VLLM supporting CPU and GPU General	1	492	September 4, 2025
Cudagraph in V1 V1 Feedback	3	1224	July 21, 2025
Building vLLM with Pytorch 2.9 General	1	398	October 15, 2025
Why does prefill use normal attention, while decode uses weight absorption in MLA? DeepSeek	1	231	August 5, 2025

How to set different attention backend for prefill and decode stage?

Related topics