@RunLLM Could you explain me on multi server model serving?
vLLM does not natively support serving multiple models from a single server instance. To serve multiple models, you must run separate vLLM server instances (each on its own port and GPU(s)), then use an external load balancer or proxy to route requests to the appropriate server based on the model requested. This approach is recommended in the official FAQ and GitHub issues, and is currently the only supported method for multi-model serving with vLLM[^1^][^2^][^3^].
If you need to serve multiple replicas of the same model for higher concurrency, you can also launch multiple vLLM servers and use a load balancer (e.g., nginx, LiteLLM, or the vLLM production stack) to distribute requests among them[^4^][^5^]. Would you like more detail on how to set up such a system?
Sources:
- vLLM FAQ
- GitHub Issue: Does vLLM support co-hosting multiple models on single server?
- GitHub Issue: How to deploy multiple models in openai api server and specify different gpu for each model?
- GitHub Issue: Can vllm serving clients by using multiple model instances?
- GitHub Issue: How to use vllm serve in ddp mode? (single node multiple gpus)
Hint: Mention @RunLLM in the post for followups.
@RunLLM My question is to serve single model in multiple servers.
To serve a single model across multiple servers (multi-replica serving), run multiple vLLM instances—each on a different GPU or node—and use a load balancer (like nginx or LiteLLM) to distribute requests among them. vLLM does not natively coordinate state between these replicas; each instance loads the same model independently and exposes its own API endpoint[^1^][^2^][^3^][^4^].
You can launch each vLLM server on a different port/GPU, then configure a frontend (e.g., nginx) to route user requests to the available servers for higher concurrency and throughput[^1^][^2^]. Would you like more detail on the setup or example configurations?
Sources:
- GitHub Issue: Can vllm serving clients by using multiple model instances?
- GitHub Issue: How to deploy vllm model with multi-replicas
- GitHub Issue: How to use vllm serve in ddp mode? (single node multiple gpus)
- vLLM Nginx Deployment Guide
Hint: Mention @RunLLM in the post for followups.
@RunLLM, I want to serve a large model using multiple servers with GPUs in each of them.
To serve a large model that does not fit on a single server, vLLM supports distributed inference using tensor parallelism and pipeline parallelism across multiple nodes (servers), each with GPUs. You must ensure all nodes have the same environment and model files. Typically, you use Ray or multiprocessing as the backend, and set --tensor-parallel-size to GPUs per node and --pipeline-parallel-size to the number of nodes. For example, with 2 nodes (8 GPUs each), use --tensor-parallel-size 8 --pipeline-parallel-size 2[^1^][^2^][^3^][^4^].
You must start a Ray cluster or use the provided helper scripts to coordinate the nodes, then launch vLLM with the appropriate parallelism flags. Efficient networking (e.g., InfiniBand) is recommended for performance[^2^][^3^]. Would you like a step-by-step setup guide?
Sources:
- Parallelism and Scaling - vLLM Docs
- Running vLLM on Multiple Nodes
- Parallelism and Scaling (GitHub)
- Distributed Inference and Serving
Hint: Mention @RunLLM in the post for followups.
@RunLLM Could you guide me the setup step-by-step. Currently I am using the below docker compose to serve a model in single server single GPU.
services:
qwen3-instruct:
image: vllm/vllm-openai:v0.15.0
container_name: qwen3-instruct
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
volumes: - /home/ranjith/.cache/huggingface:/root/.cache/huggingface
- /dev/null:/etc/ld.so.conf.d/00-cuda-compat.conf
ports: - “9122:8000”
ipc: host
command: >
/root/.cache/huggingface/hub/models–Qwen–Qwen3-30B-A3B-Instruct-2507-FP8/snapshots/5a5a776300a41aaa681dd7ff0106608ef2bc90db
–served-model-name ranjith-model
–gpu-memory-utilization 0.8
–max_model_len 16384
–max_num_batched_tokens 16384
–enable-auto-tool-choice
–tool-call-parser hermes
–enforce-eager
To serve a large model across multiple servers (multi-node, multi-GPU) with vLLM, follow these steps:
1. Prepare the environment on all nodes:
- Ensure each node has the same Docker image, Python environment, and the model files at the same path (use a shared filesystem or copy the model to each node)
- Recommended: use Docker for consistency[^1^][^2^]
2. Start a Ray cluster (recommended backend):
- On the head node, run:
bash run_cluster.sh \
vllm/vllm-openai \
<HEAD_NODE_IP> \
--head \
/path/to/huggingface/home/on/this/node \
-e VLLM_HOST_IP=<HEAD_NODE_IP>
- On each worker node, run:
bash run_cluster.sh \
vllm/vllm-openai \
<HEAD_NODE_IP> \
--worker \
/path/to/huggingface/home/on/this/node \
-e VLLM_HOST_IP=<WORKER_NODE_IP>
- Keep these shells open; closing them will terminate the cluster[^1^][^2^][^3^].
3. Launch vLLM distributed serving:
- Enter a container on any node (e.g.,
docker exec -it <container_name> /bin/bash) - Run vLLM with parallelism flags. For example, for 2 nodes with 8 GPUs each:
vllm serve /path/to/model \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray
- Adjust
--tensor-parallel-sizeand--pipeline-parallel-sizeto match your cluster[^1^][^2^][^3^].
Would you like a Docker Compose example or more details on networking and troubleshooting?
Sources:
Hint: Mention @RunLLM in the post for followups.