I’m trying to set up a Kubernetes environment to serve a smaller ~7B model for inference. The issue is my resources are very limited, and I only have access to two nodes that each have 1 GPU. The model weights loaded into VRAM occupy about a full node, so I need to pull in the GPU resources of the other node for a working solution.
Could anyone provide some insight of what would be required to change in this script to work for offline inference in K8s given my resource limitations, and what kind of deployment would need to be paired with it in order to function properly. This would be greatly appreciated, and if there are any limitations that I am not seeing with this approach please let me know, thanks!
To run a ~7B model across two single-GPU Kubernetes nodes, you can use vLLM’s distributed serving with Ray, as shown in the run_cluster.sh example. You’d need to adapt this script to launch Ray head and worker containers on your two nodes, ensuring both have the same model path and environment. For offline inference, after the Ray cluster is up, enter any container and use the vLLM Python API as if all GPUs are on one node—vLLM will leverage both GPUs transparently. The main limitation is that network speed between nodes (ideally InfiniBand) can significantly affect performance, and both nodes must have the model available at the same path (docs).
For Kubernetes deployment, you can use a custom YAML or the LWS framework to orchestrate multi-node distributed inference. The deployment should mount the model directory, set up /dev/shm for shared memory, and ensure Ray and vLLM are installed in the containers. After the cluster is running, you can run your offline inference script inside the head node’s container. Note: vLLM’s distributed mode is designed for serving, but offline inference works if you use the Python API inside the Ray cluster (docs). There are no official scripts for distributed offline inference, so you must manually ensure the Ray cluster is up before running your code.
Would you like a step-by-step example YAML or more detail on adapting the script for offline inference?
Certainly! Here’s a step-by-step outline for a minimal Kubernetes deployment using two nodes (each with 1 GPU) for distributed vLLM inference, plus guidance on adapting the run_cluster.sh script for offline inference.
1. Example Kubernetes YAML (using LWS for orchestration):
Replace /path/to/model with the actual path to your model on both nodes.
This YAML uses LWS to orchestrate Ray and vLLM containers. Both leader and worker mount the model and shared memory.
2. Adapting run_cluster.sh for Offline Inference:
The script’s main job is to start Ray head/worker nodes and set up the environment.
For offline inference, after the Ray cluster is up, exec into the leader pod/container and run your Python script using the vLLM API (e.g., python my_offline_inference.py).
Your script should use the LLM class with tensor_parallel_size=2 and point to the local model path.
Ensure the model is present at the same path in both containers.
3. Example Python for Offline Inference:
from vllm import LLM, SamplingParams
llm = LLM(model="/models", tensor_parallel_size=2)
prompts = ["Hello, world!", "What is AI?"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
Limitations:
Network speed between nodes is critical for performance (docs).
There is no official vLLM script for distributed offline inference; you must manually run your script after the Ray cluster is up.
Let me know if you need a more detailed breakdown or help with a specific part!