Data Parallel Attention Cuda Graph

Does VLLM capture cuda graphs when doing inference in single node in data parallel way, or is this topic still open ?

vLLM captures CUDA graphs by default for decoding, regardless of whether you use data parallelism, as long as you do not set enforce_eager=True. CUDA graph capture is supported in both single-node and distributed (including data parallel) settings, but the main focus is on optimizing the decoding phase, not the encoding or prefill phase. There is no explicit mention in the documentation or issues that CUDA graph capture is disabled or unsupported when using data parallelism on a single node; it is enabled unless you opt out or use a model/backend that is incompatible with CUDA graphs. For more, see Reducing CUDA Graphs and related issue.

Would you like more detail on how CUDA graph capture interacts with other parallelism strategies or specific configuration examples?

Sources: