Hello. I am using a custom network and I wanted to use multi-GPUs but looks like it does straight forward for a custom model, and the error message I got is not helping. Can someone helps with that? thank you
vLLM supports multi-GPU inference using tensor parallelism, but your custom model must be compatible with vLLM’s parallelization requirements. Specifically, the model’s architecture (e.g., number of attention heads, vocab size) must be divisible by the tensor parallel size, and the model must be implemented in a way that supports vLLM’s distributed execution. If your custom model does not meet these requirements, you may encounter errors or unsupported behavior. The error message you received likely indicates such an incompatibility or a missing implementation detail for parallelism in your custom model code. See Parallelism and Scaling and Distributed Inference and Serving for details.
Could you provide the specific error message and clarify whether your custom model is based on a supported architecture (e.g., GPT, Llama, etc.)? This will help determine if the issue is due to model incompatibility or a configuration problem.
Sources:
the link you shared does not privide any information about how to deal with a custom model
You are correct—the linked pages focus on parallelism strategies for supported models, not custom model integration. For custom models, you must ensure your implementation follows vLLM’s requirements: all modules must accept a prefix argument, and you should adapt your model’s initialization and forward methods as described in the Implementing a Basic Model guide. If your model does not follow these conventions, multi-GPU (tensor parallel) support will not work and may produce unclear errors.
Would you like a step-by-step breakdown of how to adapt a custom model for vLLM multi-GPU support?
Sources:
yes please I need that
To adapt your custom model for vLLM multi-GPU (tensor parallel) support, follow these steps:
-
Model Structure: Ensure your model is implemented as a PyTorch
nn.Module
and that all submodules (especially attention and linear layers) accept aprefix
argument in their constructors. This is required for vLLM’s runtime and quantization support. Example code is provided in the Implementing a Basic Model guide. -
Initialization: In your model’s
__init__
, pass theprefix
down to all submodules. This allows vLLM to uniquely identify and manage each layer for parallelism and quantization. -
Computation Code: Implement a
get_input_embeddings
method in your model to return embeddings for giveninput_ids
. Rewrite theforward()
method to accept flattenedinput_ids
andpositions
tensors, and remove any training-specific code. -
Parallel Layers: Replace standard linear and embedding layers with vLLM’s tensor-parallel versions (e.g.,
VocabParallelEmbedding
,RowParallelLinear
,ColumnParallelLinear
). This is necessary for sharding the model across GPUs. See the basic model guide for details. -
Weight Loading: Implement a
load_weights
method to load weights from HuggingFace checkpoints and assign them to the correct (possibly sharded) layers. -
Register Model: Register your model with vLLM so it can be used for inference and serving.
For a full example and code snippets, see the Implementing a Basic Model documentation.
Sources: