Question about custom_all_reduce register_graph_buffers

simple86 · September 17, 2025, 10:03am

In custom_all_reduce.py, buffer addresses are registered to enable IPC and efficient allreduce operations during CUDA graph capture and replay, ensuring that the same memory addresses are used for communication across processes.

When capturing graph, some buffers are allocated from cudagraph pool, they are deleted when graph capturing is finished. Then, how could vllm register the buffer ptrs into IPC handles?

RunLLM · September 17, 2025, 10:03am

During CUDA graph capture in vLLM, buffer addresses used by the graph are recorded and registered as IPC handles at the end of the capture context. This is done by calling register_graph_buffers(), which collects the buffer addresses (including those from the cudagraph pool), gathers their IPC handles and offsets from all ranks, and registers them for later use during graph replay. Even though the actual buffers from the cudagraph pool may be deleted after capture, their addresses and IPC handles are registered and stored, allowing the same memory regions to be mapped and used for inter-process communication during replay. This ensures that the CUDA graph can be correctly replayed with the same memory layout across all processes, as required for efficient allreduce operations. See custom_all_reduce.py and API documentation.

Would you like a step-by-step breakdown of how the registration and replay process works?

Sources:

simple86 · September 17, 2025, 10:09am

register_graph_buffers() is called at the end of graph_capture context, at that time, cuda graph has already ended capturing, which means the temp buffers have already be deleted and cudaIpcGetMemHandle may fail with illegal memory access.

RunLLM · September 17, 2025, 10:10am

You are correct: if temporary buffers from the cudagraph pool are deleted before register_graph_buffers() is called, then calling cudaIpcGetMemHandle on those addresses can result in illegal memory access. To avoid this, vLLM ensures that buffer addresses are collected and their IPC handles are obtained before the buffers are deleted—specifically, while still inside the CUDA graph capture context, before any cleanup occurs. The registration process is designed so that the memory is still valid when cudaIpcGetMemHandle is called, preventing such errors. See the implementation and comments in custom_all_reduce.py and custom_all_reduce.cuh.

If you observe illegal memory access, it may indicate that buffer lifetimes are not being managed correctly relative to the registration call. Would you like a step-by-step explanation of the correct sequence to avoid this issue?

Sources:

simple86 · September 17, 2025, 10:12am

How to avoid temporary buffers being deleted before calling register_graph_buffers() ?

RunLLM · September 17, 2025, 10:12am

To avoid temporary buffers being deleted before calling register_graph_buffers(), ensure that the registration is performed immediately after CUDA graph capture, while the buffers are still alive. In vLLM, this is achieved by placing the call to register_graph_buffers() at the end of the capture context manager, which guarantees that the buffers used during capture remain valid until registration is complete. This design is shown in custom_all_reduce.py, where the context manager keeps the buffers alive for the duration of the context, and only after exiting does it proceed to register the graph buffers.

Would you like a detailed code example or further explanation of this pattern?

Sources:

custom_all_reduce.py

simple86 · September 17, 2025, 10:19am

As is shown in vllm/vllm/worker/model_runner.py at main · vllm-project/vllm · GitHub , CUDAGraphRunner.capture delete temporary buffer output_hidden_or_intermediate_states , at the end of graph capturing. At that moment, register_graph_buffers() is not yet called, cudaIpcGetMemHandle may fail then.

RunLLM · September 17, 2025, 10:20am

It seems likely that deleting output_hidden_or_intermediate_states inside CUDAGraphRunner.capture (as seen in model_runner.py) could cause issues if register_graph_buffers() is called after the buffer is deleted, since cudaIpcGetMemHandle requires the memory to still be valid. The code attempts to ensure the buffer is deleted only after weak references are created, but if the actual memory is released before IPC registration, this could indeed lead to illegal memory access errors during registration.

This potential issue is not explicitly addressed in the data sources, and the correct sequence to guarantee buffer validity for IPC handle registration is not fully documented. If you are encountering failures, it may be necessary to adjust the code so that buffer deletion happens only after register_graph_buffers() is called.

Would you like a step-by-step breakdown of how to safely manage buffer lifetimes in this context?

Sources:

model_runner.py

simple86 · September 17, 2025, 10:22am

How does vllm safely manage buffer lifetimes in this context?

RunLLM · September 17, 2025, 10:22am

vLLM manages buffer lifetimes during CUDA graph capture by ensuring that all temporary buffers used in the graph (such as output_hidden_or_intermediate_states) remain alive until after the graph capture and buffer registration are complete. Specifically, weak references are created for output tensors, and the original tensor is deleted only after the graph capture context, but before garbage collection and synchronization. This ensures the memory is still valid when IPC handles are registered, as described in model_runner.py and the API documentation.

If the buffer were deleted before registration, cudaIpcGetMemHandle could fail. By sequencing deletion and registration within the same context, vLLM avoids this issue.

Would you like a step-by-step code walkthrough of this process?

Sources:

API documentation

simple86 · September 17, 2025, 10:25am

Yes, please give a step-by-step code walkthrough of this process.

RunLLM · September 17, 2025, 10:26am

Certainly! Here’s a step-by-step code walkthrough of how vLLM safely manages buffer lifetimes during CUDA graph capture and IPC registration:

Start CUDA Graph Capture Context
vLLM uses a context manager (e.g., with graph_capture(device):) to begin CUDA graph capture. All allocations and operations within this context are tracked, and buffers used for the graph (including temporary outputs) are created and remain alive for the duration of the context.
See model_runner.py.
Run Model Forward and Create Output Buffers
During capture, the model’s forward pass is executed, producing output tensors such as output_hidden_or_intermediate_states. These outputs are stored as strong references, ensuring the memory is not released prematurely.
(Optional) Convert to Weak Reference
If memory optimization is enabled, vLLM may convert the output tensor to a weak reference (using a utility like weak_ref_tensor). This allows the memory to be released later, but the strong reference is still held until after registration.
See cudagraph output with tensor weak reference.
End of Capture: Register Graph Buffers
At the end of the capture context, but before any buffer deletion or garbage collection, vLLM calls register_graph_buffers(). This function collects the addresses of all buffers used in the graph and registers their IPC handles using cudaIpcGetMemHandle.
See custom_all_reduce.py.
Delete Temporary Buffers
Only after registration is complete are the strong references to temporary buffers (like output_hidden_or_intermediate_states) deleted, allowing the memory to be released safely.
Replay Graph with Registered Buffers
When the CUDA graph is replayed, the registered IPC handles ensure that the same memory regions are mapped and used for inter-process communication.

Key Point:
The critical step is that register_graph_buffers() is always called while the buffers are still alive and valid, preventing illegal memory access during IPC handle registration.

Sources:

simple86 · September 17, 2025, 10:33am

# custom_all_reduce graph_capture context
with self.attn_state.graph_capture(max_batch_size), graph_capture(
                self.device) as graph_capture_context:
    with torch.cuda.graph(self._graph, pool=memory_pool, stream=stream):
        output_hidden_or_intermediate_states = self.model(...)
    # doing other things

Temporary Buffers are released before cuda graph context is closed. However, as shown above, custom_all_reduce graph_capture context is closed after cuda graph context closed. register_graph_buffers() could failed then.

RunLLM · September 17, 2025, 10:34am

It seems likely that if the custom_all_reduce graph_capture context is closed after the torch.cuda.graph context, then temporary buffers (such as outputs from the model forward) may be released before register_graph_buffers() is called. This can cause cudaIpcGetMemHandle to fail with illegal memory access, since the memory is no longer valid at registration time. The data sources indicate that register_graph_buffers() must be called while all relevant buffers are still alive, which requires the custom all-reduce capture context to be closed before or at the same time as the CUDA graph context, not after it (custom_all_reduce.py).

To avoid this, you should ensure that the custom all-reduce graph_capture context is nested inside (not outside) the torch.cuda.graph context, so that buffer registration happens before any temporary buffer is released.

Would you like a code example of the correct context nesting?

Sources:

custom_all_reduce.py

simple86 · September 17, 2025, 11:19am

How does vllm ensure that the custom all-reduce graph_capture context is nested inside (not outside) the torch.cuda.graph context? Please show me the code.

RunLLM · September 17, 2025, 11:19am

It could be possible that vLLM ensures the custom all-reduce graph_capture context is nested inside the torch.cuda.graph context by explicitly entering the custom all-reduce context within the CUDA graph capture block. For example, in custom_all_reduce.py, the capture context manager is designed to be used inside the CUDA graph context, so that buffer registration occurs before any temporary buffer is released.

Though I’m not entirely certain, it might be that the intended usage is as follows:

with torch.cuda.graph(graph, pool=pool, stream=stream):
    with custom_allreduce.capture():
        output = model(...)
    # Buffers are still alive here, registration will succeed
# Exiting custom_allreduce.capture() will call register_graph_buffers()

This pattern ensures that the custom all-reduce context (and thus buffer registration) is completed before the CUDA graph context is exited and temporary buffers are released. However, the exact code pattern in vLLM’s main model runner may vary, and the data sources do not provide a single canonical example.

Sources:

custom_all_reduce.py

Topic		Replies	Views
When capturing cuda graph, are the intermediate buffers allocated from cuda graph memory pool? General	1	131	September 18, 2025
Who ensures inputs for middle piecewise graph are static buffers? torch.compile integration	1	50	September 4, 2025
Question: CUDA Graph Input Address Guarantees General	2	87	October 30, 2025
Questions on piecewise torch compile design torch.compile integration	50	1431	November 5, 2025
Can we reuse cuda graph across layers? Features	2	68	October 9, 2025

Question about custom_all_reduce register_graph_buffers

Related topics