When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool or torch memory pool?
Intermediate buffers during CUDA graph capture in vLLM are allocated from the CUDA graph memory pool, not the default PyTorch memory pool. This is achieved by using the torch.cuda.graph context with the pool parameter set to the custom memory pool, ensuring all allocations inside the capture are from this pool and can be efficiently reused during graph replay. This approach is explicitly shown in the capture logic and documentation, where memory pool management is handled to optimize memory usage and sharing between different graph shapes, reducing overall memory footprint during inference and serving[^1^][^2^][^3^].
Would you like more detail on how this is implemented or how to customize the memory pool?
Sources: