Who ensures inputs for middle piecewise graph are static buffers?

gx16377 · September 4, 2025, 1:33am

Hi, vLLM Team, I’m curious about piecewise graph mechanism.

I just run into a question, CUDAGraphWrapper(or previous CudaPiecewiseBackend) needs caller ensure input buffers are same address during capture and replay, but I failed to figure out who did this.

copy_and_call in VllmBackend seems related but it only works for inputs of whole module(which are also inputs of 1st graph). What about other graphs in the middle of whole module(such as 2nd,3rd… pieces), especially for successors of splitting_ops?

RunLLM · September 4, 2025, 1:34am

In vLLM’s piecewise CUDA graph mechanism, the input buffer address consistency for each subgraph (piece) is managed by the compiler backend. Specifically, for each piece after splitting (e.g., after attention ops), vLLM pre-allocates persistent input buffers for all symbolic-shaped inputs (like input_ids, positions, etc.) and ensures that, during both capture and replay, the same memory addresses are used. The copy-and-call logic in the backend copies runtime inputs into these pre-allocated buffers before invoking the subgraph, so all pieces—including those after splitting_ops—receive inputs at the same addresses for every run. This mechanism is handled internally by the backend and is not limited to just the first piece; it applies to all submodules created by the graph split, ensuring address consistency for every piecewise cudagraph capture and replay according to vllm/compilation/backends.py, CUDAPiecewiseBackend docs, and design/torch_compile.

Would you like a step-by-step breakdown of how the persistent buffer and copy mechanism works for all subgraphs?

Sources:

Topic		Replies	Views
Question: CUDA Graph Input Address Guarantees General	2	79	October 30, 2025
When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool? General	1	101	September 18, 2025
Questions on piecewise torch compile design torch.compile integration	50	1278	November 5, 2025
Question about custom_all_reduce register_graph_buffers General	15	163	September 17, 2025
Cudagraph in V1 V1 Feedback	3	1083	July 21, 2025

Who ensures inputs for middle piecewise graph are static buffers?

Related topics