Cudagraph in V1

eric_tan · July 5, 2025, 11:02am

I want to know in the v1 version of peicewise CUDA Graph, when it will be used and when it won’t. Can CUDA Graph be used for prefill?

RunLLM · July 5, 2025, 11:02am

In vLLM v1, piecewise CUDA Graph is used by default for the computation between attention layers during decoding, but not for the attention operation itself, which runs in eager mode. This approach provides most of the performance benefits of CUDA Graphs while keeping attention flexible for optimizations and compatibility. CUDA Graph is not used for prefill by default; it is only enabled for decoding-only batches, not for prefill or mixed (chunked prefill + decode) batches. This is because prefill attention is not compatible with CUDA Graph in v1, and enabling it for prefill is still under discussion and development. For more details, see docs/design/v1/torch_compile.md, docs.vllm.ai Cudagraph Capture, and GitHub Issue #11436.

Would you like a step-by-step breakdown of how vLLM decides when to use CUDA Graphs?

Sources:

ProExpertProg · July 21, 2025, 6:23pm

To correct the response from RunLLM: it is correct that by default PIECEWISE CUDA Graphs skip attention. However, there is an option to run attention in cuda graphs if the full_cuda_graph flag is set. That is only supported for certain attention backends (FA3, Triton, MLA), and for MLA it only works for pure-decode batches. Improved support for full cuda graphs is coming in #20059

RunLLM · July 21, 2025, 6:24pm

You are correct: by default, vLLM v1 piecewise CUDA Graphs skip attention, but if you set full_cuda_graph=True in the compilation config, attention can be included in the CUDA Graph—provided the attention backend supports it. Currently, only FlashAttention 3 (FA3), Triton, and FlashMLA support this, and for FlashMLA, full CUDA Graphs are only used for pure decode batches (not prefill or mixed). Broader and more flexible support for full CUDA Graphs—including for FA2 and FlashInfer—is being developed in PR #20059 (see also docs).

Would you like more detail on how to enable or use full CUDA Graphs for supported backends?

Sources:

Topic		Replies	Views
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	218	May 22, 2025
Questions on piecewise torch compile design torch.compile integration	49	565	September 11, 2025
Data Parallel Attention Cuda Graph General	1	60	May 26, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	840	June 30, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	258	June 29, 2025

Cudagraph in V1

Related topics