I’d like to confirm: does this “support” and “non-support” only refer to whether FA2 can use the configuration compilation_config = {“cudagraph_mode”: “FULL”} (which in practice would fallback to FULL_AND_PIECEWISE), or does it mean that FA2 is now fully capable of performing full CUDA graph operations?
FA2 now supports full CUDA graph capture in vLLM V1, but with important caveats: FA2’s support is not as universal as FA3. When you set compilation_config = {"cudagraph_mode": "FULL"} for FA2, vLLM will often fallback to FULL_AND_PIECEWISE mode, because FA2 only supports full CUDA graphs for uniform batches (i.e., all requests have the same query length, such as pure decode or speculative decode), not for mixed prefill-decode batches. So, FA2 is not “fully” capable in the same way as FA3, but it does support full CUDA graph for certain batch types, and vLLM will automatically select the best mode based on backend capability and batch composition. This is documented in the vLLM CUDA Graphs design doc and confirmed in the PR #20059 discussion.
Would you like more detail on how vLLM dispatches between modes or how FA2’s limitations affect performance?
Yes, that’s correct. FA2 cannot be captured as a full CUDA graph during the prefill (mixed prefill-decode) phase because it uses special dynamic scheduling and optimizations for GQA/MQA in prefill, which are not compatible with CUDA graph capture for arbitrary batch shapes. As a result, FA2 only supports full CUDA graph for uniform decode batches (e.g., max_query_len=1 or speculative decode), but not for mixed prefill-decode batches. This limitation is described in the FlashAttentionMetadataBuilder documentation and the vLLM CUDA Graphs design doc.
Would you like a deeper technical explanation of the GQA scheduling issue?
@RunLLM In https://github.com/vllm-project/vllm/pull/20059, it states that “FA2 runs both mix prefill-decode and pure decode batches at full cudagraph, but on two separate sets of cudagraphs.” Does this mean that the statement “fa2 cannot capture a full graph” refers to the inability to capture the entire prefill+decode phase as a single full cudagraph, but if the two phases are separated, each can be captured as its own full cudagraph?
Additionally,
(1) Is the dynamic scheduling and optimization for GQA/MQA in the prefill phase referring to the two branches separated by num_splits in the FA2 code?
Correct: the statement “FA2 cannot capture a full graph” means FA2 cannot capture the entire prefill+decode phase as a single unified full cudagraph. However, if you separate the phases, FA2 can capture full cudagraphs for each phase individually—one for mixed prefill-decode and one for pure decode—using two separate sets of cudagraphs, as described in PR #20059.
(1) Yes, the dynamic scheduling and GQA/MQA optimization in FA2’s prefill phase refers to the two branches separated by num_splits in the FA2 code, which complicates unified cudagraph capture. (2) FA3 does not have this problem because it uses a unified routine for both prefill and decode, making it fully compatible with single full cudagraph capture for all batch types, as explained in the vLLM CUDA Graphs design doc and FlashAttentionMetadataBuilder docs.
Would you like more technical detail on the kernel differences between FA2 and FA3?
GQA (Grouped Query Attention) and MQA (Multi-Query Attention) optimizations in FA2 (FlashAttention v2) are designed to reduce memory bandwidth and improve efficiency by sharing key/value projections across multiple query heads. In FA2, these optimizations allow the attention kernel to use fewer key/value heads than query heads, which is especially beneficial for large models. However, the implementation uses dynamic scheduling and special kernel branches (e.g., based on num_splits) to handle GQA/MQA, particularly in the prefill phase, making it harder to capture as a single full CUDA graph. This is discussed in vLLM issue #1880 and PR #20059.
Would you like a step-by-step breakdown of how these optimizations work in the kernel?
Is the default cudagraph_mode"FULL_AND_PIECEWISE"?
When I ran a simple test on Hopper (using FA3), I noticed that unless I explicitly set cudagraph_mode to FULL, it defaults to FULL_AND_PIECEWISE. Is this expected behavior?
We have discussed that “the statement “FA2 cannot capture a full graph” means FA2 cannot capture the entire prefill+decode phase as a single unified full cudagraph. However, if you separate the phases, FA2 can capture full cudagraphs for each phase individually—one for mixed prefill-decode and one for pure decode—using two separate sets of cudagraphs“ .
I interpret this as meaning there are two full graphs: one for decode (full graph A) and another for prefill-decode (full graph B).
However, from my actual testing, I observed that FULL_AND_PIECEWISE creates graph A + standalone FlashAttention and a separate full graph B.
So which scenario is actually correct in practice?
Yes, the default cudagraph_mode is FULL_AND_PIECEWISE in vLLM v1, unless you explicitly set it to FULL or another mode—this is confirmed in the vLLM config docs and PR #25444. So your observation on Hopper with FA3 is expected: unless you set cudagraph_mode to FULL, it defaults to FULL_AND_PIECEWISE.
Regarding FA2: in FULL_AND_PIECEWISE mode, vLLM creates a full cudagraph for uniform decode batches (graph A), and for mixed prefill-decode batches, it uses piecewise cudagraphs (not a full graph B). Only in FULL mode (if supported by the backend) would vLLM attempt to create two full cudagraphs: one for decode and one for mixed prefill-decode. However, for FA2, FULL mode typically falls back to FULL_AND_PIECEWISE due to backend limitations, so in practice, you get graph A (full decode) and piecewise graphs for prefill/mixed, not two full graphs. This is described in the CUDA Graphs design doc and PR #20059.
Would you like more detail on how the dispatcher chooses which graphs to use at runtime?
Okay, regarding FA2, here’s the code snippet I found related to GQA optimization:
// Faster to transpose q from (b, 1, (nheads_kv ngroups), d) to (b, ngroups, nheads_kv, d) in this case
// H/t Daniel Haziza
const int seqlenq_ngroups_swapped = max_seqlen_q == 1 && num_heads > num_heads_k && window_size_left < 0 && window_size_right < 0 && p_dropout == 0.f && head_size_og % 8 == 0 && !alibi_slopes_.has_value();
const int ngroups = num_heads / num_heads_k;
if (seqlenq_ngroups_swapped) {
q = q.reshape({batch_size, num_heads_k, ngroups, head_size_og}).transpose(1, 2).reshape({batch_size * ngroups, num_heads_k, head_size_og});
max_seqlen_q = ngroups;
num_heads = num_heads_k;
cu_seqlens_q_d = nullptr;
}
My understanding is that in mixed prefill/decode scenarios, where max_seqlen_q != 1, the code does not take the GQA-optimized routine. If that’s the case, why is FlashAttention still separated out (as PIECEWISE graphs) and unable to be included in a full graph? Thanks.
Your understanding is correct: in mixed prefill/decode scenarios (where max_seqlen_q != 1), FA2 does not use the GQA-optimized routine shown in your snippet. However, FlashAttention is still separated out as PIECEWISE graphs and not included in a full cudagraph because, in these mixed scenarios, the attention kernel’s execution path and input shapes can vary dynamically at runtime (e.g., due to different query lengths and batch compositions). This dynamic behavior prevents safe and efficient CUDA graph capture for the entire model forward, as CUDA graphs require static execution paths and shapes. As a result, vLLM falls back to PIECEWISE cudagraphs for attention in mixed batches, even if the GQA optimization is not triggered, to ensure correctness and compatibility. This is discussed in the vLLM CUDA Graphs design doc and PR #20059.
Would you like more detail on how vLLM determines when to use full vs. piecewise cudagraphs?
@RunLLM Then what about FA3? If not set cudagraph_mode to FULL explicitly, FA3 also defaults to FULL_AND_PIECEWISE, and when I profiled it, I saw that FlashAttention is still separated from the graph. So can I understand it this way: as long as the mode is FULL_AND_PIECEWISE, FlashAttention will be split out by default — even if in this case FlashAttention only has one routine (FA3 supports FULL, so it must have just one routine, right?)
Yes, your understanding is correct: when cudagraph_mode is set to FULL_AND_PIECEWISE (the default), FlashAttention (even FA3) will be split out as a separate operation in piecewise graphs for mixed prefill-decode batches. FA3 does support a unified routine and can run in FULL mode (where the entire model, including FlashAttention, is captured as a single full cudagraph), but unless you explicitly set cudagraph_mode to FULL, vLLM will default to FULL_AND_PIECEWISE and split out FlashAttention for non-uniform batches. This behavior is described in the vLLM CUDA Graphs design doc and confirmed in PR #20059.
Would you like more detail on the performance or trade-offs between these modes?