Running vLLM multi-node data parallel with SLURM

Hello :slight_smile:

I understand the preferred method for multi-node inference is using a ray server. I’ve been trying to manage a training run with vLLM as an inference server on a SLURM cluster.

I’ve aiming for something like your Data Parallel Deployment tutorial, modifying this setup script into an sbatch script and something like this for my vLLM server.

I haven’t had much luck yet and seem to be running into problems with resource conflicts between the ray server and vLLM server that are run as separate srun commands. I’m not quite sure how I can avoid these conflicts whilst making sure they can still talk to each other.

Just wondering if there are any existing recipes/approaches for this setup?

Any pointers would be greatly appreciated :victory_hand:

The recommended approach for multi-node vLLM inference on SLURM is to launch a Ray cluster across your nodes, then start vLLM within that Ray cluster. Resource conflicts often occur if Ray and vLLM are started as separate jobs or if both try to claim the same GPUs. The best practice is to allocate all resources to a single Ray cluster (using SLURM), then launch vLLM as a Ray task within that allocation, ensuring Ray manages all GPU resources and vLLM does not compete for them separately. See the official distributed serving guide and parallelism scaling guide for details.

A typical recipe is: (1) allocate all nodes/GPUs in one SLURM job, (2) start Ray head on one node and Ray workers on others (all within the same job), (3) launch vLLM from within the Ray cluster, not as a separate SLURM job. This avoids resource conflicts. Also, ensure all nodes have the same model files and environment. For a working example, see the run_cluster.sh script and related SLURM usage in this issue. Would you like a step-by-step example SLURM script?

Sources: