Hello vLLM developers/community,
I am currently investigating the graph partitioning logic within vLLM, specifically focusing on how it leverages CUDA Graphs for performance optimization. I’ve encountered a point of confusion regarding memory address guarantees and would appreciate your expertise.
Context:
As I understand it, vLLM’s “graph mode” partitions the overall computation graph into multiple subgraphs, using the attention operator as the designated split op. The subgraphs composed of non-attention operators are then captured and replayed using CUDA Graphs to minimize kernel launch overhead.
Core Question:
A fundamental prerequisite for CUDA Graph execution is that the virtual addresses of all input buffers (tensors) must remain constant across graph captures and subsequent replays. My question is: What mechanism does vLLM employ to ensure this 100% address stability for the inputs to these CUDA Graph-captured subgraphs?
Point of Confusion:
In an attempt to answer this, I consulted a general-purpose LLM. Its response was that this address consistency is not strictly guaranteed but rather that the memory allocator (in this case, vLLM’s custom allocator) will, with very high probability, provide the same address for a given allocation pattern.
This leads me to my specific inquiry: Is this probabilistic view correct? Or does vLLM implement a deterministic strategy (e.g., pre-allocating a persistent memory pool, using a specific allocation flag, etc.) that guarantees the input addresses for the CUDA Graphs will never change once the graph is captured?
Any insight into the implementation details that ensure this critical invariant would be extremely helpful for my understanding.
Thank you for your time and for developing this fantastic project.
┌────────────────────────────────────────────────────────────────┐
│ First Run (Capture) │
│ ──────────────────────────────────────────────────────────── │
│ │
│ 1. Submod_0 CUDA Graph Capture │
│ ├─ PyTorch allocator allocates tensor → address 0x1000 │
│ └─ Recorded into the CUDA Graph │
│ │
│ 2. Attention (Eager, not in the CUDA Graph) │
│ ├─ output = torch.zeros(...) → allocator returns 0x2000 │
│ └─ Execute attention kernel │
│ │
│ 3. Submod_1 CUDA Graph Capture │
│ ├─ Reads input (at address 0x2000) │
│ ├─ Records `input_addresses = [0x2000]` ← Key point! │
│ └─ Captured into the CUDA Graph │
└────────────────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ Second Run (Replay) │
│ ──────────────────────────────────────────────────────────── │
│ │
│ 1. Submod_0 CUDA Graph Replay │
│ └─ tensor is still at 0x1000 (guaranteed by CUDA Graph) │
│ │
│ 2. Attention (Eager, executed again) │
│ ├─ output = torch.zeros(...) → allocator *very likely* returns 0x2000 │
│ │ WHY? Because: │
│ │ • Same size request │
│ │ • 0x2000 was just freed back to the pool │
│ │ • No other interfering allocations │
│ │ • The caching allocator prioritizes reusing recently freed blocks │
│ └─ Execute attention kernel │
│ │
│ 3. Submod_1 CUDA Graph Replay │
│ ├─ Checks input address: expects 0x2000 │
│ ├─ Actual input address: 0x2000 ✓ │
│ │ (If not, an assert will fail in debug mode) │
│ └─ Replay CUDA Graph │
└────────────────────────────────────────────────────────────────┘