I want to know in the v1 version of peicewise CUDA Graph, when it will be used and when it won’t. Can CUDA Graph be used for prefill?
In vLLM v1, piecewise CUDA Graph is used by default for the computation between attention layers during decoding, but not for the attention operation itself, which runs in eager mode. This approach provides most of the performance benefits of CUDA Graphs while keeping attention flexible for optimizations and compatibility. CUDA Graph is not used for prefill by default; it is only enabled for decoding-only batches, not for prefill or mixed (chunked prefill + decode) batches. This is because prefill attention is not compatible with CUDA Graph in v1, and enabling it for prefill is still under discussion and development. For more details, see docs/design/v1/torch_compile.md, docs.vllm.ai Cudagraph Capture, and GitHub Issue #11436.
Would you like a step-by-step breakdown of how vLLM decides when to use CUDA Graphs?
Sources:
To correct the response from RunLLM: it is correct that by default PIECEWISE CUDA Graphs skip attention. However, there is an option to run attention in cuda graphs if the full_cuda_graph
flag is set. That is only supported for certain attention backends (FA3, Triton, MLA), and for MLA it only works for pure-decode batches. Improved support for full cuda graphs is coming in #20059
You are correct: by default, vLLM v1 piecewise CUDA Graphs skip attention, but if you set full_cuda_graph=True
in the compilation config, attention can be included in the CUDA Graph—provided the attention backend supports it. Currently, only FlashAttention 3 (FA3), Triton, and FlashMLA support this, and for FlashMLA, full CUDA Graphs are only used for pure decode batches (not prefill or mixed). Broader and more flexible support for full CUDA Graphs—including for FA2 and FlashInfer—is being developed in PR #20059 (see also docs).
Would you like more detail on how to enable or use full CUDA Graphs for supported backends?
Sources: