Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes

Hi vLLM community,

I’m deploying vLLM in Deepseek R1 mode on a Kubernetes cluster with 2 nodes running NVIDIA H100 GPUs. I have two main questions:

  1. How can I connect vLLM to an existing Ray cluster deployed across these two nodes?
    The Ray cluster is already up and running with head and worker nodes, and I want to ensure my vLLM serving pods properly join and communicate with this cluster.
  2. What is the best practice to launch vLLM with tensor parallel size = 8 and pipeline parallel size = 2 in Kubernetes?
    I’m aiming to configure multi-node parallelism for Deepseek R1 workloads, possibly using KServe’s ServingRuntime or a custom deployment spec.

If anyone has example manifests, environment setups, or scripts for Deepseek R1 with TP=8 and PP=2, I’d be very grateful!

To connect vLLM to an existing Ray cluster on Kubernetes, ensure your vLLM pods have the same Python environment and model path as the Ray nodes, and set the VLLM_HOST_IP environment variable to each pod’s IP. Start vLLM inside the container with the --distributed-executor-backend ray flag and point to the Ray head node using the --ray-address argument if needed. Only one vLLM serve process should be started (typically on the head node), and it will utilize all GPUs in the Ray cluster for distributed inference. See distributed serving docs and Kubernetes LWS example.

For best practice with TP=8 and PP=2, use a deployment manifest similar to the LWS example, setting --tensor-parallel-size 8 --pipeline-parallel-size 2 in your vLLM command. Each pod should mount /dev/shm, set IPC_LOCK capability, and have access to the model path. Example YAML and setup are provided in the LWS documentation. If using KServe, adapt the container spec to match these requirements. Ensure the model is accessible on both nodes and that network communication (preferably InfiniBand) is configured for efficient cross-node tensor parallelism.

Would you like a step-by-step breakdown or a sample Kubernetes manifest?

Sources: