Can we reuse cuda graph across layers?

AgentXBJ · October 9, 2025, 7:01am

I am thinking about it that we can capture only ONE transformer layer to replay at following layers and forward steps for different layers of most LLMs are identical. Compared to piecewise cuda graph, It can reduce num of cuda graph from (num_bs * num_layer) to num_bs, reducing GPU memory for cuda graph, then maybe more batch-sizes graph can be captured ! Thanks for any comments.

RunLLM · October 9, 2025, 7:02am

Capturing a single transformer layer as a CUDA graph and replaying it for all layers is not currently supported in vLLM. Each layer may have different parameters, states, or input/output buffer addresses, so vLLM captures piecewise CUDA graphs per layer and batch size to ensure correctness and performance. This results in (num_layers × num_batch_sizes) graphs, but vLLM uses memory optimizations like shared output buffers to reduce the memory footprint per graph, as described in PR #5074 and cuda_graph.py.

Your idea could theoretically reduce the number of graphs, but vLLM’s current design does not support sharing a single graph across all layers due to the need for layer-specific compilation and memory management. For now, you can reduce memory usage by limiting the number of batch sizes captured (see docs). Would you like a deeper technical breakdown of why per-layer capture is required?

Sources:

Hint: Mention @RunLLM in the post for followups.

AgentXBJ · October 9, 2025, 7:14am

“but vLLM’s current design does not support sharing a single graph across all layers due to the need for layer-specific compilation and memory management” More about these “layer-specific compilation and memory management“

what if i already succeed to solve problem about per layer gemm weights, norm gemma/beta, etc model weights ? how much do you think benifits of per-layer cuda graph can be ?

@RunLLM

Topic		Replies	Views
How does CUDA graph memory scale in vLLM General	1	39	March 12, 2026
When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool? General	1	134	September 18, 2025
Data Parallel Attention Cuda Graph General	1	127	May 26, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	820	June 29, 2025
Question: CUDA Graph Input Address Guarantees General	2	87	October 30, 2025

Can we reuse cuda graph across layers?

Related topics