Questions about cuda graph compatibility with Attention Backend in vLLM

yhyang201 · May 22, 2025, 3:44am

Hi everyone,
I’ve learned a lot from vLLM, and I would like to thank all the developers for their hard work!

I came across a PR ([Core] Support full cuda graph in v1 by chanh · Pull Request #16072 · vllm-project/vllm · GitHub) that mentions: “Full CUDA graph only works with V1 + FA3, since FA2 currently is not graphable due to an optimization for GQA.”

I would like to ask: What is the reason FA2 is incompatible with CUDA graph? It seems like GQA doesn’t have any particularly special aspects.
Additionally, I also came across the following documentation (vLLM’s torch.compile integration — vLLM), and based on my understanding, Full CUDA graph is only compatible with FA3. What is the reason for incompatibility with Flashinfer and other attention?

Thank you in advance for your answers! I wish you all a pleasant day!

RunLLM · May 22, 2025, 3:45am

Full CUDA graph capture in vLLM V1 is only supported with FlashAttention 3 (FA3) because FA2 uses an optimization for Grouped Query Attention (GQA) that makes its attention kernel not compatible with CUDA graph capture. Specifically, the FA2 kernel’s dynamic scheduling and GQA-specific optimizations prevent it from being “graphable”—that is, from being safely and efficiently captured and replayed as part of a CUDA graph. This limitation is explicitly mentioned in the PR #16072 and official documentation.

Similarly, Flashinfer and other attention backends are not currently compatible with full CUDA graph capture because their kernels either lack the necessary static execution properties or have not yet implemented the required support for CUDA graph capture. Only FA3, with its updated kernel design, is currently supported for full CUDA graph in vLLM V1, and only when cascade attention is disabled. See docs and PR #16072 for details.

Would you like a deeper technical explanation of the kernel-level reasons for these incompatibilities?

Sources:

Topic		Replies	Views
Cudagraph in V1 V1 Feedback	3	187	July 21, 2025
Data Parallel Attention Cuda Graph General	1	45	May 26, 2025
Questions on piecewise torch compile design torch.compile integration	45	408	July 21, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	637	June 30, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	110	June 29, 2025

Questions about cuda graph compatibility with Attention Backend in vLLM

Related topics