Multi-node K8s GPU pooling

nullUser · July 17, 2025, 3:46pm

Hey y’all,

I’m trying to set up a Kubernetes environment to serve a smaller ~7B model for inference. The issue is my resources are very limited, and I only have access to two nodes that each have 1 GPU. The model weights loaded into VRAM occupy about a full node, so I need to pull in the GPU resources of the other node for a working solution.

This script from the examples seems like a good place to start: vllm/examples/online_serving/run_cluster.sh at main · vllm-project/vllm · GitHub

Could anyone provide some insight of what would be required to change in this script to work for offline inference in K8s given my resource limitations, and what kind of deployment would need to be paired with it in order to function properly. This would be greatly appreciated, and if there are any limitations that I am not seeing with this approach please let me know, thanks!

RunLLM · July 17, 2025, 3:47pm

To run a ~7B model across two single-GPU Kubernetes nodes, you can use vLLM’s distributed serving with Ray, as shown in the run_cluster.sh example. You’d need to adapt this script to launch Ray head and worker containers on your two nodes, ensuring both have the same model path and environment. For offline inference, after the Ray cluster is up, enter any container and use the vLLM Python API as if all GPUs are on one node—vLLM will leverage both GPUs transparently. The main limitation is that network speed between nodes (ideally InfiniBand) can significantly affect performance, and both nodes must have the model available at the same path (docs).

For Kubernetes deployment, you can use a custom YAML or the LWS framework to orchestrate multi-node distributed inference. The deployment should mount the model directory, set up /dev/shm for shared memory, and ensure Ray and vLLM are installed in the containers. After the cluster is running, you can run your offline inference script inside the head node’s container. Note: vLLM’s distributed mode is designed for serving, but offline inference works if you use the Python API inside the Ray cluster (docs). There are no official scripts for distributed offline inference, so you must manually ensure the Ray cluster is up before running your code.

Would you like a step-by-step example YAML or more detail on adapting the script for offline inference?

Sources:

nullUser · July 17, 2025, 3:52pm

Yes can you provide that step-by-step example YAML and more detail on adapting the script for offline inference?

RunLLM · July 17, 2025, 3:53pm

Certainly! Here’s a step-by-step outline for a minimal Kubernetes deployment using two nodes (each with 1 GPU) for distributed vLLM inference, plus guidance on adapting the run_cluster.sh script for offline inference.

1. Example Kubernetes YAML (using LWS for orchestration):

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: vllm
spec:
  replicas: 2
  leaderWorkerTemplate:
    size: 2
    restartPolicy: RecreateGroupOnPodRestart
    leaderTemplate:
      metadata:
        labels:
          role: leader
      spec:
        containers:
          - name: vllm-leader
            image: docker.io/vllm/vllm-openai:latest
            command:
              - sh
              - -c
              - |
                bash /vllm-workspace/examples/online_serving/multi-node-serving.sh leader --ray_cluster_size=$(LWS_GROUP_SIZE)
                sleep infinity  # Prevents container exit after Ray cluster is up
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: model-volume
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
          - name: model-volume
            hostPath:
              path: /path/to/model
              type: Directory
    workerTemplate:
      spec:
        containers:
          - name: vllm-worker
            image: docker.io/vllm/vllm-openai:latest
            command:
              - sh
              - -c
              - |
                bash /vllm-workspace/examples/online_serving/multi-node-serving.sh worker --ray_address=$(LWS_LEADER_ADDRESS)
                sleep infinity
            resources:
              limits:
                nvidia.com/gpu: 1
            volumeMounts:
              - mountPath: /dev/shm
                name: dshm
              - mountPath: /models
                name: model-volume
        volumes:
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 8Gi
          - name: model-volume
            hostPath:
              path: /path/to/model
              type: Directory

Replace /path/to/model with the actual path to your model on both nodes.
This YAML uses LWS to orchestrate Ray and vLLM containers. Both leader and worker mount the model and shared memory.

2. Adapting run_cluster.sh for Offline Inference:

The script’s main job is to start Ray head/worker nodes and set up the environment.
For offline inference, after the Ray cluster is up, exec into the leader pod/container and run your Python script using the vLLM API (e.g., python my_offline_inference.py).
Your script should use the LLM class with tensor_parallel_size=2 and point to the local model path.
Ensure the model is present at the same path in both containers.

3. Example Python for Offline Inference:

from vllm import LLM, SamplingParams

llm = LLM(model="/models", tensor_parallel_size=2)
prompts = ["Hello, world!", "What is AI?"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(output.outputs[0].text)

Limitations:

Network speed between nodes is critical for performance (docs).
There is no official vLLM script for distributed offline inference; you must manually run your script after the Ray cluster is up.

Let me know if you need a more detailed breakdown or help with a specific part!

Sources:

Topic		Replies	Views
Connecting to Ray Cluster and Launching vLLM with TP=8 and PP=2 in Deepseek R1 Mode on Kubernetes General	1	170	July 6, 2025
Offline multi-node inference General	11	379	May 19, 2025
Deploying multi-node LLM with infiband/ROCE General	11	206	August 15, 2025
Running vLLM multi-node data parallel with SLURM RL Integration	1	298	August 12, 2025
Deploy a big LLM when GPU VRAM not enough General	21	944	August 13, 2025

Multi-node K8s GPU pooling

Related topics