Mukti-GPUs on vLLM using a custom network

eghouti · September 5, 2025, 3:29pm

Hello. I am using a custom network and I wanted to use multi-GPUs but looks like it does straight forward for a custom model, and the error message I got is not helping. Can someone helps with that? thank you

RunLLM · September 5, 2025, 3:30pm

vLLM supports multi-GPU inference using tensor parallelism, but your custom model must be compatible with vLLM’s parallelization requirements. Specifically, the model’s architecture (e.g., number of attention heads, vocab size) must be divisible by the tensor parallel size, and the model must be implemented in a way that supports vLLM’s distributed execution. If your custom model does not meet these requirements, you may encounter errors or unsupported behavior. The error message you received likely indicates such an incompatibility or a missing implementation detail for parallelism in your custom model code. See Parallelism and Scaling and Distributed Inference and Serving for details.

Could you provide the specific error message and clarify whether your custom model is based on a supported architecture (e.g., GPT, Llama, etc.)? This will help determine if the issue is due to model incompatibility or a configuration problem.

Sources:

eghouti · September 5, 2025, 3:40pm

the link you shared does not privide any information about how to deal with a custom model

RunLLM · September 5, 2025, 3:40pm

You are correct—the linked pages focus on parallelism strategies for supported models, not custom model integration. For custom models, you must ensure your implementation follows vLLM’s requirements: all modules must accept a prefix argument, and you should adapt your model’s initialization and forward methods as described in the Implementing a Basic Model guide. If your model does not follow these conventions, multi-GPU (tensor parallel) support will not work and may produce unclear errors.

Would you like a step-by-step breakdown of how to adapt a custom model for vLLM multi-GPU support?

Sources:

Implementing a Basic Model

eghouti · September 5, 2025, 3:42pm

yes please I need that

RunLLM · September 5, 2025, 3:43pm

To adapt your custom model for vLLM multi-GPU (tensor parallel) support, follow these steps:

Model Structure: Ensure your model is implemented as a PyTorch nn.Module and that all submodules (especially attention and linear layers) accept a prefix argument in their constructors. This is required for vLLM’s runtime and quantization support. Example code is provided in the Implementing a Basic Model guide.
Initialization: In your model’s __init__, pass the prefix down to all submodules. This allows vLLM to uniquely identify and manage each layer for parallelism and quantization.
Computation Code: Implement a get_input_embeddings method in your model to return embeddings for given input_ids. Rewrite the forward() method to accept flattened input_ids and positions tensors, and remove any training-specific code.
Parallel Layers: Replace standard linear and embedding layers with vLLM’s tensor-parallel versions (e.g., VocabParallelEmbedding, RowParallelLinear, ColumnParallelLinear). This is necessary for sharding the model across GPUs. See the basic model guide for details.
Weight Loading: Implement a load_weights method to load weights from HuggingFace checkpoints and assign them to the correct (possibly sharded) layers.
Register Model: Register your model with vLLM so it can be used for inference and serving.

For a full example and code snippets, see the Implementing a Basic Model documentation.

Sources:

Implementing a Basic Model

Topic		Replies	Views
Setting up VllmConfig for custom gpt2 General	2	74	April 29, 2025
How to load the model successfully through multi-card in vllm? General	5	146	April 3, 2025
Setting two LLMs on different GPUs in one offline inference script General	1	146	June 6, 2025
Help for error when run vllm with tensor parallel General	7	830	May 21, 2025
Does vLLM support multiple model_executor? Scheduling	1	188	April 28, 2025

Mukti-GPUs on vLLM using a custom network

Related topics