Questions on piecewise torch compile design

why lm_head can not be piecewise compiled

It seems likely that lm_head cannot be piecewise compiled because it is not a custom op or a natural graph boundary like attention, and thus is not included in the list of splitting ops for piecewise compilation. Piecewise compilation in vLLM is designed to split the graph at specific operations (e.g., attention) that benefit from eager execution or have special requirements, while lm_head is typically a simple linear layer that does not require such treatment (design doc).

Would you like more detail on how splitting ops are chosen or how to customize them?

Sources:

so will cuda graph capture lm_head region?

No, CUDA graph capture in vLLM does not include the lm_head region. The computation graph for CUDA graph capture only covers the model from input IDs/embeddings up to the final hidden states; the lm_head projection and sampling operations are explicitly excluded from the captured graph (design doc).

Would you like more detail on what is included in the captured graph?

Sources:

and v0 engine will capture lm_head and logitsprocessor?

In vLLM v0, CUDA graph capture includes the lm_head region but does not capture the logits processor. The computation graph in v0 typically covers the model up to and including the lm_head projection, but sampling and logits processing (such as temperature scaling or penalties) are performed outside the captured graph in Python (design doc).

Would you like more detail on the v0 graph capture process?

Sources:

In vLLM v0, CUDA graph capture includes the lm_head region but does not capture the logitsprocessor. The lm_head (final projection layer) is part of the model’s forward pass and is included in the captured CUDA graph, while the logitsprocessor (which applies post-processing to logits) is executed outside the CUDA graph in Python for flexibility (issue 1019, custom_all_reduce.py).

Would you like more detail on the execution flow or graph capture boundaries?

Sources:

how about the benefit to capture lm_head in cuda graph in further development

Capturing lm_head in the CUDA graph can reduce kernel launch overhead and improve end-to-end latency, especially for small batch sizes or low-latency scenarios. This is because lm_head is a simple linear projection and including it in the graph allows for more efficient, fused execution with the rest of the model (issue 1019).

However, the benefit is modest for large batch sizes or throughput-bound workloads, since most compute time is spent in the main transformer layers. The main advantage is in latency-sensitive use cases.

Would you like more detail on when this optimization is most impactful?

Sources:

why piecewise compile disable custom op?

Piecewise compilation disables most custom ops (except attention) because custom ops can interfere with graph splitting and optimization in torch.compile. Inductor and CUDA graph capture require clear graph boundaries, and custom ops may introduce side effects or in-place mutations that break assumptions needed for correct piecewise execution (PR 9058, custom_op.py).

Would you like more detail on how to control which custom ops are enabled?

Sources:

when compile with enable_sequence_parallelism, why the computation_graph seems no changed except moving rmsnorm’s implementation to cuda kernel? allreduce didn’t split to reduce scatter and all gather on Qwen3 model.

When compiling with enable_sequence_parallelism, the computation graph may appear unchanged except for moving RMSNorm to a CUDA kernel because the sequence parallelism pass only rewrites certain patterns (e.g., matmul → allreduce → rmsnorm) if the model and kernel patterns match exactly. For Qwen3, if the model’s graph does not match the expected pattern or does not use the custom RMSNorm op, the allreduce will not be split into reduce-scatter and all-gather (PR 16155, sequence_parallelism.py).

This means sequence parallelism is only applied if the model uses the custom RMSNorm op and matches the pass’s pattern. Otherwise, the graph remains unchanged.

Would you like more detail on how to enable or debug sequence parallelism for your model?

Sources:

how to understand
sequence parallelism is not enabled for because symbolic shape can’t be split

Sequence parallelism is not enabled if the symbolic shape (e.g., batch size or token count) cannot be evenly divided by the tensor parallel size. The sequence parallelism pass checks if the shape is divisible; if not, it skips rewriting the graph, so reduce-scatter/all-gather are not inserted (sequence_parallelism.py).

Would you like more detail on how this check is performed or how to debug it?

Sources:

how fused_add_rms_norm pattern work for the weight with 2-dim tensor [4,4], because the cuda kernel interface only accept the rms_norm)weight parameter for only 1 dim?

It seems likely that the fused_add_rms_norm CUDA kernel in vLLM expects the weight parameter to be a 1-dimensional tensor matching the hidden size, not a 2D tensor. If you pass a 2D weight tensor like [4, 4], it will raise a shape mismatch error, since the kernel and the Python wrapper both check that weight’s shape matches the last dimension of the input (layernorm_kernels.cu, layernorm.py).

Would you like more detail on how to adapt your weight tensor or the expected input shapes?

Sources:

how to force prefill stage without cuda graph?

To force the prefill stage to run without CUDA graph in vLLM, set enforce_eager=True when creating the LLM instance, or use the --enforce-eager flag in the CLI. This disables CUDA graph compilation for all stages, including prefill (optimization.md, issue 3137).

Example:

from vllm import LLM
llm = LLM(model="your-model", enforce_eager=True)

Would you like more detail on the effects or alternatives?

Sources:

why we need _set_default_torch_dtype when loading weights?