Understanding Multi Node Parallelization

To serve a large model across multiple servers (multi-node, multi-GPU) with vLLM, follow these steps:

1. Prepare the environment on all nodes:

  • Ensure each node has the same Docker image, Python environment, and the model files at the same path (use a shared filesystem or copy the model to each node)
  • Recommended: use Docker for consistency[^1^][^2^]

2. Start a Ray cluster (recommended backend):

  • On the head node, run:
bash run_cluster.sh \
    vllm/vllm-openai \
    <HEAD_NODE_IP> \
    --head \
    /path/to/huggingface/home/on/this/node \
    -e VLLM_HOST_IP=<HEAD_NODE_IP>
  • On each worker node, run:
bash run_cluster.sh \
    vllm/vllm-openai \
    <HEAD_NODE_IP> \
    --worker \
    /path/to/huggingface/home/on/this/node \
    -e VLLM_HOST_IP=<WORKER_NODE_IP>
  • Keep these shells open; closing them will terminate the cluster[^1^][^2^][^3^].

3. Launch vLLM distributed serving:

  • Enter a container on any node (e.g., docker exec -it <container_name> /bin/bash)
  • Run vLLM with parallelism flags. For example, for 2 nodes with 8 GPUs each:
vllm serve /path/to/model \
    --tensor-parallel-size 8 \
    --pipeline-parallel-size 2 \
    --distributed-executor-backend ray
  • Adjust --tensor-parallel-size and --pipeline-parallel-size to match your cluster[^1^][^2^][^3^].

Would you like a Docker Compose example or more details on networking and troubleshooting?

Sources:

Hint: Mention @RunLLM in the post for followups.