_set_default_torch_dtype is used when loading weights to ensure that newly created tensors (such as empty or meta tensors) have the correct dtype matching the model’s expected precision (e.g., float16, bfloat16, or float32). This prevents dtype mismatches and potential numerical errors during weight initialization and loading (utils.py, docs).
Would you like more detail on when and how this context manager is applied?
Certainly! Here’s a step-by-step explanation of why and how _set_default_torch_dtype is used when loading weights in vLLM:
Purpose:
When loading a model, vLLM may need to create new tensors (e.g., initializing model parameters as empty or meta tensors before loading actual weights). By default, PyTorch uses float32 for new tensors unless the default dtype is changed. If your model expects float16 or bfloat16 (common for efficient inference), creating tensors with the wrong dtype can cause mismatches or errors during weight assignment (utils.py).
How It Works:
vLLM wraps the model initialization and weight loading code in a context manager called set_default_torch_dtype. This temporarily sets the global default dtype (e.g., to torch.float16) so that all new tensors created within this block have the correct dtype. After exiting the context, the default dtype is restored to its previous value. This ensures that all model parameters and buffers are initialized with the intended precision (docs).
Example usage:
from vllm.model_executor.model_loader.utils import set_default_torch_dtype
import torch
with set_default_torch_dtype(torch.float16):
# All new tensors created here will be float16
model = MyModel()
# ... load weights, etc.
# After this block, the default dtype is restored
This approach prevents subtle bugs and ensures numerical consistency across the model.
No, vLLM v1 does not use torch.distributed.all_reduce in EAGER mode. Instead, it uses a custom allreduce implementation (such as pynccl or its own custom kernel) for both eager and CUDA graph modes, as support for torch.distributed.all_reduce has been removed due to compatibility and performance reasons (PR 10121).
Would you like more detail on the custom allreduce implementation or its advantages?
It seems likely that vLLM supports custom torch.compile backends, and you can provide deterministic embeddings as input using the prompt embedding feature. You can pass prompt embeddings directly via the prompt_embeds field in offline or online inference (prompt_embeds docs). For custom torch.compile backends, you can specify the backend string in the compilation config (compilation config docs).
However, there is no explicit mention in the data sources of combining a custom torch.compile backend with prompt embeddings for deterministic input, but both features are supported independently.
Would you like more detail on configuring custom backends or using prompt embeddings?
Correct—prompt embeddings are only supported in the v0 engine as of now, so you cannot use prompt embeddings with the v1 engine, even if the embeddings are deterministic or unchanged (prompt_embeds docs, v1 user guide). There is ongoing work to add prompt embedding support to v1, but it is not yet available (RFC #22124, PR #24278).
Would you like more detail on the status or roadmap for prompt embeddings in v1?