To serve a large model across multiple servers (multi-node, multi-GPU) with vLLM, follow these steps:
1. Prepare the environment on all nodes:
- Ensure each node has the same Docker image, Python environment, and the model files at the same path (use a shared filesystem or copy the model to each node)
- Recommended: use Docker for consistency[^1^][^2^]
2. Start a Ray cluster (recommended backend):
- On the head node, run:
bash run_cluster.sh \
vllm/vllm-openai \
<HEAD_NODE_IP> \
--head \
/path/to/huggingface/home/on/this/node \
-e VLLM_HOST_IP=<HEAD_NODE_IP>
- On each worker node, run:
bash run_cluster.sh \
vllm/vllm-openai \
<HEAD_NODE_IP> \
--worker \
/path/to/huggingface/home/on/this/node \
-e VLLM_HOST_IP=<WORKER_NODE_IP>
- Keep these shells open; closing them will terminate the cluster[^1^][^2^][^3^].
3. Launch vLLM distributed serving:
- Enter a container on any node (e.g.,
docker exec -it <container_name> /bin/bash) - Run vLLM with parallelism flags. For example, for 2 nodes with 8 GPUs each:
vllm serve /path/to/model \
--tensor-parallel-size 8 \
--pipeline-parallel-size 2 \
--distributed-executor-backend ray
- Adjust
--tensor-parallel-sizeand--pipeline-parallel-sizeto match your cluster[^1^][^2^][^3^].
Would you like a Docker Compose example or more details on networking and troubleshooting?
Sources:
Hint: Mention @RunLLM in the post for followups.