Questions on piecewise torch compile design

Juelianqvq · May 22, 2025, 9:17am

why lm_head can not be piecewise compiled

RunLLM · May 22, 2025, 9:18am

It seems likely that lm_head cannot be piecewise compiled because it is not a custom op or a natural graph boundary like attention, and thus is not included in the list of splitting ops for piecewise compilation. Piecewise compilation in vLLM is designed to split the graph at specific operations (e.g., attention) that benefit from eager execution or have special requirements, while lm_head is typically a simple linear layer that does not require such treatment (design doc).

Would you like more detail on how splitting ops are chosen or how to customize them?

Sources:

design doc

Juelianqvq · May 22, 2025, 9:22am

so will cuda graph capture lm_head region?

RunLLM · May 22, 2025, 9:23am

No, CUDA graph capture in vLLM does not include the lm_head region. The computation graph for CUDA graph capture only covers the model from input IDs/embeddings up to the final hidden states; the lm_head projection and sampling operations are explicitly excluded from the captured graph (design doc).

Would you like more detail on what is included in the captured graph?

Sources:

design doc

Juelianqvq · May 22, 2025, 9:32am

and v0 engine will capture lm_head and logitsprocessor?

RunLLM · May 22, 2025, 9:32am

In vLLM v0, CUDA graph capture includes the lm_head region but does not capture the logits processor. The computation graph in v0 typically covers the model up to and including the lm_head projection, but sampling and logits processing (such as temperature scaling or penalties) are performed outside the captured graph in Python (design doc).

Would you like more detail on the v0 graph capture process?

Sources:

RunLLM · May 22, 2025, 9:32am

In vLLM v0, CUDA graph capture includes the lm_head region but does not capture the logitsprocessor. The lm_head (final projection layer) is part of the model’s forward pass and is included in the captured CUDA graph, while the logitsprocessor (which applies post-processing to logits) is executed outside the CUDA graph in Python for flexibility (issue 1019, custom_all_reduce.py).

Would you like more detail on the execution flow or graph capture boundaries?

Sources:

issue 1019

Juelianqvq · May 22, 2025, 9:41am

how about the benefit to capture lm_head in cuda graph in further development

RunLLM · May 22, 2025, 9:42am

Capturing lm_head in the CUDA graph can reduce kernel launch overhead and improve end-to-end latency, especially for small batch sizes or low-latency scenarios. This is because lm_head is a simple linear projection and including it in the graph allows for more efficient, fused execution with the rest of the model (issue 1019).

However, the benefit is modest for large batch sizes or throughput-bound workloads, since most compute time is spent in the main transformer layers. The main advantage is in latency-sensitive use cases.

Would you like more detail on when this optimization is most impactful?

Sources:

Juelianqvq · May 23, 2025, 3:23am

why piecewise compile disable custom op?

RunLLM · May 23, 2025, 3:23am

Piecewise compilation disables most custom ops (except attention) because custom ops can interfere with graph splitting and optimization in torch.compile. Inductor and CUDA graph capture require clear graph boundaries, and custom ops may introduce side effects or in-place mutations that break assumptions needed for correct piecewise execution (PR 9058, custom_op.py).

Would you like more detail on how to control which custom ops are enabled?

Sources:

Juelianqvq · May 27, 2025, 8:00am

when compile with enable_sequence_parallelism, why the computation_graph seems no changed except moving rmsnorm’s implementation to cuda kernel? allreduce didn’t split to reduce scatter and all gather on Qwen3 model.

RunLLM · May 27, 2025, 8:01am

When compiling with enable_sequence_parallelism, the computation graph may appear unchanged except for moving RMSNorm to a CUDA kernel because the sequence parallelism pass only rewrites certain patterns (e.g., matmul → allreduce → rmsnorm) if the model and kernel patterns match exactly. For Qwen3, if the model’s graph does not match the expected pattern or does not use the custom RMSNorm op, the allreduce will not be split into reduce-scatter and all-gather (PR 16155, sequence_parallelism.py).

This means sequence parallelism is only applied if the model uses the custom RMSNorm op and matches the pass’s pattern. Otherwise, the graph remains unchanged.

Would you like more detail on how to enable or debug sequence parallelism for your model?

Sources:

Juelianqvq · May 28, 2025, 2:59am

how to understand
sequence parallelism is not enabled for because symbolic shape can’t be split

RunLLM · May 28, 2025, 3:00am

Sequence parallelism is not enabled if the symbolic shape (e.g., batch size or token count) cannot be evenly divided by the tensor parallel size. The sequence parallelism pass checks if the shape is divisible; if not, it skips rewriting the graph, so reduce-scatter/all-gather are not inserted (sequence_parallelism.py).

Would you like more detail on how this check is performed or how to debug it?

Sources:

sequence_parallelism.py

Juelianqvq · May 29, 2025, 12:53pm

how fused_add_rms_norm pattern work for the weight with 2-dim tensor [4,4], because the cuda kernel interface only accept the rms_norm)weight parameter for only 1 dim?

RunLLM · May 29, 2025, 12:54pm

It seems likely that the fused_add_rms_norm CUDA kernel in vLLM expects the weight parameter to be a 1-dimensional tensor matching the hidden size, not a 2D tensor. If you pass a 2D weight tensor like [4, 4], it will raise a shape mismatch error, since the kernel and the Python wrapper both check that weight’s shape matches the last dimension of the input (layernorm_kernels.cu, layernorm.py).

Would you like more detail on how to adapt your weight tensor or the expected input shapes?

Sources:

Juelianqvq · June 18, 2025, 8:01am

how to force prefill stage without cuda graph?

RunLLM · June 18, 2025, 8:01am

To force the prefill stage to run without CUDA graph in vLLM, set enforce_eager=True when creating the LLM instance, or use the --enforce-eager flag in the CLI. This disables CUDA graph compilation for all stages, including prefill (optimization.md, issue 3137).

Example:

from vllm import LLM
llm = LLM(model="your-model", enforce_eager=True)

Would you like more detail on the effects or alternatives?

Sources:

Juelianqvq · June 30, 2025, 1:31am

why we need _set_default_torch_dtype when loading weights?

Topic		Replies	Views
Cudagraph in V1 V1 Feedback	3	241	July 21, 2025
Data Parallel Attention Cuda Graph General	1	47	May 26, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	154	May 22, 2025
How vllm v1 enable compile as default feature torch.compile integration	5	63	July 2, 2025
vLLM V1 - Default max CUDA graph size V1 Feedback	2	663	June 30, 2025

Questions on piecewise torch compile design

Related topics