I am thinking about it that we can capture only ONE transformer layer to replay at following layers and forward steps for different layers of most LLMs are identical. Compared to piecewise cuda graph, It can reduce num of cuda graph from (num_bs * num_layer) to num_bs, reducing GPU memory for cuda graph, then maybe more batch-sizes graph can be captured ! Thanks for any comments.
Capturing a single transformer layer as a CUDA graph and replaying it for all layers is not currently supported in vLLM. Each layer may have different parameters, states, or input/output buffer addresses, so vLLM captures piecewise CUDA graphs per layer and batch size to ensure correctness and performance. This results in (num_layers × num_batch_sizes) graphs, but vLLM uses memory optimizations like shared output buffers to reduce the memory footprint per graph, as described in PR #5074 and cuda_graph.py.
Your idea could theoretically reduce the number of graphs, but vLLM’s current design does not support sharing a single graph across all layers due to the need for layer-specific compilation and memory management. For now, you can reduce memory usage by limiting the number of batch sizes captured (see docs). Would you like a deeper technical breakdown of why per-layer capture is required?
Sources:
Hint: Mention @RunLLM in the post for followups.
“but vLLM’s current design does not support sharing a single graph across all layers due to the need for layer-specific compilation and memory management” More about these “layer-specific compilation and memory management“
what if i already succeed to solve problem about per layer gemm weights, norm gemma/beta, etc model weights ? how much do you think benifits of per-layer cuda graph can be ?