Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes

Midhun0313 · July 6, 2025, 8:50pm

Hi vLLM community,

I’m deploying vLLM in Deepseek R1 mode on a Kubernetes cluster with 2 nodes running NVIDIA H100 GPUs. I have two main questions:

How can I connect vLLM to an existing Ray cluster deployed across these two nodes?
The Ray cluster is already up and running with head and worker nodes, and I want to ensure my vLLM serving pods properly join and communicate with this cluster.
What is the best practice to launch vLLM with tensor parallel size = 8 and pipeline parallel size = 2 in Kubernetes?
I’m aiming to configure multi-node parallelism for Deepseek R1 workloads, possibly using KServe’s ServingRuntime or a custom deployment spec.

If anyone has example manifests, environment setups, or scripts for Deepseek R1 with TP=8 and PP=2, I’d be very grateful!

RunLLM · July 6, 2025, 8:50pm

To connect vLLM to an existing Ray cluster on Kubernetes, ensure your vLLM pods have the same Python environment and model path as the Ray nodes, and set the VLLM_HOST_IP environment variable to each pod’s IP. Start vLLM inside the container with the --distributed-executor-backend ray flag and point to the Ray head node using the --ray-address argument if needed. Only one vLLM serve process should be started (typically on the head node), and it will utilize all GPUs in the Ray cluster for distributed inference. See distributed serving docs and Kubernetes LWS example.

For best practice with TP=8 and PP=2, use a deployment manifest similar to the LWS example, setting --tensor-parallel-size 8 --pipeline-parallel-size 2 in your vLLM command. Each pod should mount /dev/shm, set IPC_LOCK capability, and have access to the model path. Example YAML and setup are provided in the LWS documentation. If using KServe, adapt the container spec to match these requirements. Ensure the model is accessible on both nodes and that network communication (preferably InfiniBand) is configured for efficient cross-node tensor parallelism.

Would you like a step-by-step breakdown or a sample Kubernetes manifest?

Sources:

Topic		Replies	Views
vLLM cannot connect to existing Ray cluster Ray	16	744	May 8, 2025
Multi-node K8s GPU pooling Features	3	141	July 17, 2025
Deploying multi-node LLM with infiband/ROCE General	11	206	August 15, 2025
Running vLLM multi-node data parallel with SLURM RL Integration	1	297	August 12, 2025
Ray cluster DeepSeek-R1-Distill-Qwen-32B-AWQ General	43	219	August 25, 2025

Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes

Related topics