Who ensures inputs for middle piecewise graph are static buffers?

Hi, vLLM Team, I’m curious about piecewise graph mechanism.

I just run into a question, CUDAGraphWrapper(or previous CudaPiecewiseBackend) needs caller ensure input buffers are same address during capture and replay, but I failed to figure out who did this.

copy_and_call in VllmBackend seems related but it only works for inputs of whole module(which are also inputs of 1st graph). What about other graphs in the middle of whole module(such as 2nd,3rd… pieces), especially for successors of splitting_ops?

In vLLM’s piecewise CUDA graph mechanism, the input buffer address consistency for each subgraph (piece) is managed by the compiler backend. Specifically, for each piece after splitting (e.g., after attention ops), vLLM pre-allocates persistent input buffers for all symbolic-shaped inputs (like input_ids, positions, etc.) and ensures that, during both capture and replay, the same memory addresses are used. The copy-and-call logic in the backend copies runtime inputs into these pre-allocated buffers before invoking the subgraph, so all pieces—including those after splitting_ops—receive inputs at the same addresses for every run. This mechanism is handled internally by the backend and is not limited to just the first piece; it applies to all submodules created by the graph split, ensuring address consistency for every piecewise cudagraph capture and replay according to vllm/compilation/backends.py, CUDAPiecewiseBackend docs, and design/torch_compile.

Would you like a step-by-step breakdown of how the persistent buffer and copy mechanism works for all subgraphs?

Sources: