Questions on piecewise torch compile design

RunLLM · June 30, 2025, 1:31am

_set_default_torch_dtype is used when loading weights to ensure that newly created tensors (such as empty or meta tensors) have the correct dtype matching the model’s expected precision (e.g., float16, bfloat16, or float32). This prevents dtype mismatches and potential numerical errors during weight initialization and loading (utils.py, docs).

Would you like more detail on when and how this context manager is applied?

Sources:

utils.py

Juelianqvq · June 30, 2025, 1:32am

yes, more details please

RunLLM · June 30, 2025, 1:32am

Certainly! Here’s a step-by-step explanation of why and how _set_default_torch_dtype is used when loading weights in vLLM:

Purpose:
When loading a model, vLLM may need to create new tensors (e.g., initializing model parameters as empty or meta tensors before loading actual weights). By default, PyTorch uses float32 for new tensors unless the default dtype is changed. If your model expects float16 or bfloat16 (common for efficient inference), creating tensors with the wrong dtype can cause mismatches or errors during weight assignment (utils.py).
How It Works:
vLLM wraps the model initialization and weight loading code in a context manager called set_default_torch_dtype. This temporarily sets the global default dtype (e.g., to torch.float16) so that all new tensors created within this block have the correct dtype. After exiting the context, the default dtype is restored to its previous value. This ensures that all model parameters and buffers are initialized with the intended precision (docs).

Example usage:

from vllm.model_executor.model_loader.utils import set_default_torch_dtype
import torch

with set_default_torch_dtype(torch.float16):
    # All new tensors created here will be float16
    model = MyModel()
    # ... load weights, etc.
# After this block, the default dtype is restored

This approach prevents subtle bugs and ensures numerical consistency across the model.

Sources:

utils.py

kailashbuki · June 30, 2025, 2:28pm

Does vLLM v1 use torch.distributed.all_reduce in EAGER mode?

RunLLM · June 30, 2025, 2:28pm

No, vLLM v1 does not use torch.distributed.all_reduce in EAGER mode. Instead, it uses a custom allreduce implementation (such as pynccl or its own custom kernel) for both eager and CUDA graph modes, as support for torch.distributed.all_reduce has been removed due to compatibility and performance reasons (PR 10121).

Would you like more detail on the custom allreduce implementation or its advantages?

Sources:

PR 10121

ProExpertProg · July 21, 2025, 6:25pm

Hey, I hope you got answers you were looking for; if not, feel free to post more questions in the Slack (either #questions or #sig-torch-compile)

Juelianqvq · September 11, 2025, 7:41am

does vllm support custom torch compile with input which has a deterministic embedding

RunLLM · September 11, 2025, 7:42am

It seems likely that vLLM supports custom torch.compile backends, and you can provide deterministic embeddings as input using the prompt embedding feature. You can pass prompt embeddings directly via the prompt_embeds field in offline or online inference (prompt_embeds docs). For custom torch.compile backends, you can specify the backend string in the compilation config (compilation config docs).

However, there is no explicit mention in the data sources of combining a custom torch.compile backend with prompt embeddings for deterministic input, but both features are supported independently.

Would you like more detail on configuring custom backends or using prompt embeddings?

Sources:

Juelianqvq · September 11, 2025, 7:43am

Prompt embeddings are currently only supported in the v0 engine. So v1 engine cannot be used even if the embedding stay unchanged?

RunLLM · September 11, 2025, 7:43am

Correct—prompt embeddings are only supported in the v0 engine as of now, so you cannot use prompt embeddings with the v1 engine, even if the embeddings are deterministic or unchanged (prompt_embeds docs, v1 user guide). There is ongoing work to add prompt embedding support to v1, but it is not yet available (RFC #22124, PR #24278).

Would you like more detail on the status or roadmap for prompt embeddings in v1?

Sources:

Juelianqvq · November 5, 2025, 3:32am

    def replacement(mul: torch.Tensor, mm_weight: torch.Tensor):

        gemm_rs = torch.ops.symm_mem.fused_matmul_reduce_scatter(

            mul,

            mm_weight,

“avg”,

scatter_dim=0,

group_name=self.tp.device_group.group_name,

return gemm_rs

where is the source code of torch.ops.symm_mem.fused_matmul_reduce_scatter?

Topic		Replies	Views
Cudagraph in V1 V1 Feedback	3	1091	July 21, 2025
How vllm v1 enable compile as default feature torch.compile integration	5	255	July 2, 2025
Data Parallel Attention Cuda Graph General	1	102	May 26, 2025
Question about full cudagraph of FlashAttention-v2 General	13	66	January 5, 2026
True eager backend General	6	30	February 24, 2026

Questions on piecewise torch compile design

Related topics