When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool?

simple86 · September 18, 2025, 12:38pm

When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool or torch memory pool?

RunLLM · September 18, 2025, 12:38pm

Intermediate buffers during CUDA graph capture in vLLM are allocated from the CUDA graph memory pool, not the default PyTorch memory pool. This is achieved by using the torch.cuda.graph context with the pool parameter set to the custom memory pool, ensuring all allocations inside the capture are from this pool and can be efficiently reused during graph replay. This approach is explicitly shown in the capture logic and documentation, where memory pool management is handled to optimize memory usage and sharing between different graph shapes, reducing overall memory footprint during inference and serving[^1^][^2^][^3^].

Would you like more detail on how this is implemented or how to customize the memory pool?

Sources:

Topic		Replies	Views
Question about custom_all_reduce register_graph_buffers General	15	163	September 17, 2025
Who ensures inputs for middle piecewise graph are static buffers? torch.compile integration	1	47	September 4, 2025
Question: CUDA Graph Input Address Guarantees General	2	79	October 30, 2025
Can we reuse cuda graph across layers? Features	2	65	October 9, 2025
Why is cuda graph capture sizes limited by max_num_seqs Scheduling	1	679	June 29, 2025

When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool?

Related topics