Question: CUDA Graph Input Address Guarantees

Hello vLLM developers/community,

I am currently investigating the graph partitioning logic within vLLM, specifically focusing on how it leverages CUDA Graphs for performance optimization. I’ve encountered a point of confusion regarding memory address guarantees and would appreciate your expertise.

Context:
As I understand it, vLLM’s “graph mode” partitions the overall computation graph into multiple subgraphs, using the attention operator as the designated split op. The subgraphs composed of non-attention operators are then captured and replayed using CUDA Graphs to minimize kernel launch overhead.

Core Question:
A fundamental prerequisite for CUDA Graph execution is that the virtual addresses of all input buffers (tensors) must remain constant across graph captures and subsequent replays. My question is: What mechanism does vLLM employ to ensure this 100% address stability for the inputs to these CUDA Graph-captured subgraphs?

Point of Confusion:
In an attempt to answer this, I consulted a general-purpose LLM. Its response was that this address consistency is not strictly guaranteed but rather that the memory allocator (in this case, vLLM’s custom allocator) will, with very high probability, provide the same address for a given allocation pattern.

This leads me to my specific inquiry: Is this probabilistic view correct? Or does vLLM implement a deterministic strategy (e.g., pre-allocating a persistent memory pool, using a specific allocation flag, etc.) that guarantees the input addresses for the CUDA Graphs will never change once the graph is captured?

Any insight into the implementation details that ensure this critical invariant would be extremely helpful for my understanding.

Thank you for your time and for developing this fantastic project.

┌────────────────────────────────────────────────────────────────┐
│  First Run (Capture)                                           │
│  ────────────────────────────────────────────────────────────  │
│                                                                 │
│  1. Submod_0 CUDA Graph Capture                                │
│     ├─ PyTorch allocator allocates tensor → address 0x1000    │
│     └─ Recorded into the CUDA Graph                            │
│                                                                 │
│  2. Attention (Eager, not in the CUDA Graph)                   │
│     ├─ output = torch.zeros(...) → allocator returns 0x2000   │
│     └─ Execute attention kernel                                │
│                                                                 │
│  3. Submod_1 CUDA Graph Capture                                │
│     ├─ Reads input (at address 0x2000)                         │
│     ├─ Records `input_addresses = [0x2000]` ← Key point!      │
│     └─ Captured into the CUDA Graph                            │
└────────────────────────────────────────────────────────────────┘

┌────────────────────────────────────────────────────────────────┐
│  Second Run (Replay)                                           │
│  ────────────────────────────────────────────────────────────  │
│                                                                 │
│  1. Submod_0 CUDA Graph Replay                                 │
│     └─ tensor is still at 0x1000 (guaranteed by CUDA Graph)    │
│                                                                 │
│  2. Attention (Eager, executed again)                          │
│     ├─ output = torch.zeros(...) → allocator *very likely* returns 0x2000 │
│     │   WHY? Because:                                           │
│     │   • Same size request                                     │
│     │   • 0x2000 was just freed back to the pool                │
│     │   • No other interfering allocations                     │
│     │   • The caching allocator prioritizes reusing recently freed blocks │
│     └─ Execute attention kernel                                │
│                                                                 │
│  3. Submod_1 CUDA Graph Replay                                 │
│     ├─ Checks input address: expects 0x2000                    │
│     ├─ Actual input address: 0x2000 ✓                          │
│     │   (If not, an assert will fail in debug mode)            │
│     └─ Replay CUDA Graph                                       │
└────────────────────────────────────────────────────────────────┘

vLLM guarantees input address stability for CUDA Graph-captured subgraphs by explicitly managing and reusing persistent input buffers, not by relying on probabilistic allocator behavior. During graph capture, vLLM records the data pointers (addresses) of all input tensors, and on every replay, it asserts that the addresses match the recorded ones—if not, it will fail in debug mode. This is achieved by pre-allocating and reusing the same input buffers for each batch shape, ensuring deterministic address stability required by CUDA Graphs. The allocator is not relied upon for probabilistic reuse; instead, vLLM’s runtime and compilation logic enforce this invariant by design, as shown in the CUDAGraphWrapper implementation and design documentation.

If the input addresses do not match during replay, vLLM will raise an assertion error in debugging mode, preventing silent errors. This deterministic strategy is critical for correctness and is not left to chance or allocator heuristics. For more details, see the design doc and source code.

Would you like a step-by-step breakdown of how vLLM manages these buffers and enforces address stability?

Sources:

Hint: Mention @RunLLM in the post for followups.

From Sonnet 4.5:

The accurate summary should be:

vLLM uses a hybrid strategy for CUDA Graph input address stability:

  1. Model top-level inputs: Optionally explicitly manages persistent buffers (cudagraph_copy_inputs, disabled by default)

  2. Intermediate tensors between subgraphs: Relies on PyTorch caching allocator’s “pseudo-deterministic” behavior under stable workload patterns

  3. Verification mechanism: Debug mode checks address consistency, but this is detection rather than prevention

In practice, this is highly reliable because LLM inference execution patterns are extremely stable, but it’s not theoretically 100% guaranteed. This is a pragmatic engineering trade-off.