Questions about cuda graph compatibility with Attention Backend in vLLM

Hi everyone,
I’ve learned a lot from vLLM, and I would like to thank all the developers for their hard work!

I came across a PR ([Core] Support full cuda graph in v1 by chanh · Pull Request #16072 · vllm-project/vllm · GitHub) that mentions: “Full CUDA graph only works with V1 + FA3, since FA2 currently is not graphable due to an optimization for GQA.”

  1. I would like to ask: What is the reason FA2 is incompatible with CUDA graph? It seems like GQA doesn’t have any particularly special aspects.

  2. Additionally, I also came across the following documentation (vLLM’s torch.compile integration — vLLM), and based on my understanding, Full CUDA graph is only compatible with FA3. What is the reason for incompatibility with Flashinfer and other attention?

Thank you in advance for your answers! I wish you all a pleasant day!

Full CUDA graph capture in vLLM V1 is only supported with FlashAttention 3 (FA3) because FA2 uses an optimization for Grouped Query Attention (GQA) that makes its attention kernel not compatible with CUDA graph capture. Specifically, the FA2 kernel’s dynamic scheduling and GQA-specific optimizations prevent it from being “graphable”—that is, from being safely and efficiently captured and replayed as part of a CUDA graph. This limitation is explicitly mentioned in the PR #16072 and official documentation.

Similarly, Flashinfer and other attention backends are not currently compatible with full CUDA graph capture because their kernels either lack the necessary static execution properties or have not yet implemented the required support for CUDA graph capture. Only FA3, with its updated kernel design, is currently supported for full CUDA graph in vLLM V1, and only when cascade attention is disabled. See docs and PR #16072 for details.

Would you like a deeper technical explanation of the kernel-level reasons for these incompatibilities?

Sources: