Data Parallel Attention Cuda Graph

sparsh35 · May 26, 2025, 10:05am

Does VLLM capture cuda graphs when doing inference in single node in data parallel way, or is this topic still open ?

RunLLM · May 26, 2025, 10:05am

vLLM captures CUDA graphs by default for decoding, regardless of whether you use data parallelism, as long as you do not set enforce_eager=True. CUDA graph capture is supported in both single-node and distributed (including data parallel) settings, but the main focus is on optimizing the decoding phase, not the encoding or prefill phase. There is no explicit mention in the documentation or issues that CUDA graph capture is disabled or unsupported when using data parallelism on a single node; it is enabled unless you opt out or use a model/backend that is incompatible with CUDA graphs. For more, see Reducing CUDA Graphs and related issue.

Would you like more detail on how CUDA graph capture interacts with other parallelism strategies or specific configuration examples?

Sources:

Topic		Replies	Views
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	154	May 22, 2025
Cudagraph in V1 V1 Feedback	3	240	July 21, 2025
Questions on piecewise torch compile design torch.compile integration	45	436	July 21, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	661	June 30, 2025
"vllm serve" to use all GPU's of a single-node multi-use General	5	59	August 1, 2025

Data Parallel Attention Cuda Graph

Related topics