Questions on piecewise torch compile design

Juelianqvq · May 7, 2025, 12:52pm

Hi, vLLM Team, I’m a student and interested in v1’s feature especially on piecewise torch compile mechanism. Though have reviewed your codes in a few days and try to improve my understanding, I was puzzled on some details. For example, you have implemented custom outplace_all_reduce in one node with piecewise cuda graph, I’m confused about the address registered in custom_all_reduce.py v.s. cuda graph’s address management in backends.py , are they related or not? And during the prefill stage, it seems that cudagraph is applied only if uncached tokens_num is less than certain batchsize, so which one is this shape, the max capture decoding batchsize or the general shape(as 8192 in chunked prefill)?

RunLLM · May 7, 2025, 12:53pm

The addresses registered in custom_all_reduce.py (for custom out-of-place allreduce) and the CUDA graph’s address management in backends.py are related but serve different purposes. In custom_all_reduce.py, buffer addresses are registered to enable IPC and efficient allreduce operations during CUDA graph capture and replay, ensuring that the same memory addresses are used for communication across processes. This is crucial because CUDA graphs require all memory addresses used during capture to be identical during replay, and register_graph_buffers() is called at the end of the capture context to record these addresses (custom_all_reduce.py, API docs).

In backends.py, CUDA graph address management ensures that the input buffers for each captured graph (for each batch size) are reused and consistent, which is necessary for correct CUDA graph replay (backends.py). During the prefill stage, CUDA graph is applied only if the batch size matches one of the captured sizes (e.g., [1, 2, 4, 8] or a user-specified list). The “shape” here refers to the batch size for which the CUDA graph was captured, not the general shape (like 8192 in chunked prefill). You can override which batch sizes are captured using the cudagraph_capture_sizes config (design doc).

Would you like a step-by-step breakdown of how these address registrations interact during graph capture and replay?

Sources:

Juelianqvq · May 7, 2025, 3:48pm

Thanks for your warm explanation! As you mentioned, they are related and serve different purposes. However I’ve noticed this comment in custom_all_reduce.py and haven’t figured it out yet. Can you elaborate more on this? Also, the compilation process function self.compiled_graph_for_general_shape shouldn’t incur any IPC or just mimic the behaviour? Please correct me if I misunderstood.

RunLLM · May 7, 2025, 3:49pm

It seems likely that the comment in custom_all_reduce.py about not deduplicating addresses during graph buffer registration refers to intentionally registering all buffer addresses used during CUDA graph capture, even if some addresses are reused. This is to ensure correctness across ranks, since different processes might have different allocation patterns, and CUDA IPC handles have internal reference counting to manage any overhead (custom_all_reduce.cuh).

Regarding self.compiled_graph_for_general_shape in backends.py, my understanding is that this function compiles the computation graph for a symbolic (general) shape and does not perform any IPC or CUDA graph capture. It only mimics the computation for generalization and warmup purposes; actual IPC and CUDA graph address registration only occur during concrete shape graph capture and replay (backends.py).

Would you like a more detailed step-by-step explanation of the registration and compilation process?

Sources:

Juelianqvq · May 7, 2025, 3:51pm

Yeah, more detailed step-by-step explanation of the registration and compilation process is welcome.

RunLLM · May 7, 2025, 3:51pm

My understanding is that the registration and compilation process in vLLM V1 with piecewise CUDA graphs and custom allreduce works as follows:

Graph Compilation (General Shape):
- self.compiled_graph_for_general_shape compiles the model’s computation graph for a symbolic (general) shape using torch.compile. This step does not involve CUDA graph capture or IPC buffer registration; it simply prepares a generic, reusable computation graph (backends.py, design doc).
Concrete Shape Compilation and CUDA Graph Capture:
- For each batch size specified in cudagraph_capture_sizes, vLLM compiles the graph for that concrete shape and then captures a CUDA graph.
- During CUDA graph capture, all input/output buffer addresses (including those used by custom allreduce) are registered. This is done so that, during replay, the same memory addresses are used, which is a requirement for CUDA graphs (custom_all_reduce.py).
- The registration is handled by calling register_graph_buffers at the end of the capture context, ensuring all relevant addresses are tracked for IPC and CUDA graph replay.
Replay:
- When a batch of the same size is processed, vLLM replays the captured CUDA graph, reusing the registered buffer addresses for both computation and custom allreduce.
- This ensures efficient execution and correct inter-process communication, as the CUDA graph and custom allreduce both rely on the same memory layout and addresses.

Let me know if you want code snippets or further breakdown of any specific step.

Sources:

Juelianqvq · May 13, 2025, 5:45am

why piecewise compilation here needs outplace allreduce not the inplace one?

RunLLM · May 13, 2025, 5:45am

Piecewise compilation in vLLM requires out-of-place allreduce because PyTorch custom ops used in compiled graphs do not support in-place mutation or returning a new tensor in the same op. This restriction means the allreduce operation must always be out-of-place, producing a new tensor as output rather than modifying the input tensor directly (parallel_state.py, custom_all_reduce.py).

Would you like more detail on how this is implemented or why PyTorch imposes this limitation?

Sources:

Juelianqvq · May 13, 2025, 5:46am

why PyTorch imposes this limitation

RunLLM · May 13, 2025, 5:46am

PyTorch imposes this limitation because custom ops in torch.compile (Dynamo/Inductor) cannot safely support in-place mutation or returning a new tensor in the same op. This is due to the way PyTorch’s graph capture and optimization work: in-place mutations can break graph assumptions, aliasing analysis, and memory safety, making it difficult for the compiler to reason about tensor lifetimes and optimize execution (parallel_state.py, PR 9061).

As a result, vLLM’s custom allreduce is implemented as an out-of-place operation to ensure compatibility with PyTorch’s compilation and graph capture mechanisms.

Sources:

Juelianqvq · May 13, 2025, 5:47am

how about the performance between custom_allreduce and pynccl or torch.distributed package

RunLLM · May 13, 2025, 5:47am

Custom allreduce in vLLM generally outperforms both PyTorch’s torch.distributed and the pynccl wrapper, especially for small batch sizes and on NVLink/NVSwitch-connected GPUs. Benchmarks show up to 5–10% throughput and latency improvement over NCCL-based allreduce, and it can also reduce memory usage (PR 2192, PR 2760). For large batch sizes or PCIe-only multi-GPU setups, the performance gain is smaller, and NCCL may be preferable in some cases.

Would you like more detailed benchmark data or configuration examples?

Sources:

Juelianqvq · May 17, 2025, 2:46am

where is the place to update self.kv_cache in v1 engine?

RunLLM · May 17, 2025, 2:46am

In vLLM v1, self.kv_cache is updated within the attention layer’s forward method, where the KV cache is accessed and written to during each forward pass. The actual update occurs in the attention implementation, using the self.kv_cache attribute, which is managed per layer (layer.py).

Would you like a step-by-step trace of how and where this update happens in the code?

Sources:

layer.py

Juelianqvq · May 21, 2025, 4:57am

find it slower than v0, what do you think is the root cause?

RunLLM · May 21, 2025, 4:57am

vLLM v1 can be slower than v0 in some scenarios, especially for single requests or when multi-step streaming is used, due to architectural changes like unified scheduling and chunked prefill always being enabled. Some users report higher first-token latency (TTFT) and slower throughput for certain workloads (issue 11649, issue 17540). The root causes include less aggressive multi-step streaming, different batching policies, and overhead from new features.

Would you like a detailed breakdown of these causes and possible mitigations?

Sources:

Juelianqvq · May 21, 2025, 10:18am

how to use full_cuda_graph and what is the difference?

RunLLM · May 21, 2025, 10:18am

To use full_cuda_graph in vLLM, set compilation_config={"full_cuda_graph": True} when launching the server or creating the LLM instance. This enables capturing the entire forward pass—including attention—in a single CUDA graph, rather than splitting at attention layers. Currently, only FlashAttention 3 is compatible, and cascade attention must be disabled (design doc, PR 16072).

The main difference is that full_cuda_graph can improve performance for smaller models by reducing kernel launch overhead, but it may disable some attention optimizations and is less flexible than piecewise capture. Use it only when your attention backend supports it.

Would you like a code example or more technical details?

Sources:

Juelianqvq · May 21, 2025, 10:21am

more technical details

RunLLM · May 21, 2025, 10:21am

Certainly! Here’s a step-by-step technical breakdown of full_cuda_graph in vLLM v1:

Piecewise vs. Full CUDA Graph:
- By default, vLLM v1 uses “piecewise” CUDA graph capture: it splits the computation graph at attention layers, capturing CUDA graphs for the segments before, between, and after attention. Attention runs in eager mode for flexibility (design doc).
- With full_cuda_graph, the entire forward pass—including attention—is captured in a single CUDA graph. This is only possible if the attention backend is CUDA graph-compatible (currently, only FlashAttention 3 with cascade attention disabled).
How to Enable:
- Pass --compilation-config "{'full_cuda_graph': True}" to vllm serve or set compilation_config={"full_cuda_graph": True} in the Python API.
- Internally, vLLM checks that the attention backend is FlashAttention 3 and disables cascade attention. If not, it raises an error (PR 16072).
Technical Implications:
- Performance: Full CUDA graph can reduce kernel launch overhead, especially for small models or batch sizes, by replaying the entire computation in one go.
- Limitations: Disables some attention optimizations (like cascade attention) and may not benefit large models or complex attention patterns, as FlashAttention’s internal heuristics are bypassed.
- Memory: Full CUDA graph may use more memory, as all intermediate buffers for the entire forward pass are retained for replay.
Fallback:
- If the attention backend is not compatible, vLLM falls back to piecewise CUDA graph capture.

Let me know if you want a code snippet or further details on graph capture internals.

Sources:

Topic		Replies	Views
Cudagraph in V1 V1 Feedback	3	241	July 21, 2025
Data Parallel Attention Cuda Graph General	1	47	May 26, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	154	May 22, 2025
How vllm v1 enable compile as default feature torch.compile integration	5	63	July 2, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	662	June 30, 2025

Questions on piecewise torch compile design

Related topics