How can I prepare a Kubernetes manifest to accommodate embeddings with vLLM image? We will use service and persistent volume for the weights. What other argumenets and environmental parameters will be needed? The weights will be downloaded from Hugging face
To prepare a Kubernetes manifest for vLLM with embeddings, you need a Deployment (using the vLLM image), a PersistentVolumeClaim (PVC) for model weights, a Secret for your Hugging Face token (if required), and a Service to expose the API. For embeddings, set the --task embed
argument in your container’s command or args. You must also mount the PVC to /root/.cache/huggingface
and set the HUGGING_FACE_HUB_TOKEN
environment variable if accessing gated models. Example arguments: vllm serve <model_name> --task embed
.
A minimal manifest includes:
- PVC for model cache
- Secret for Hugging Face token
- Deployment with:
- vLLM image
- args:
vllm serve <model_name> --task embed
- env:
HUGGING_FACE_HUB_TOKEN
from Secret - volumeMounts for model cache
- Service exposing port 8000
See the code example below and refer to Kubernetes deployment docs, embedding task argument, and model config.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: ["vllm serve <model_name> --task embed"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
Would you like a more detailed breakdown or explanation of each section?
Sources:
Currently I use this manifest as below. I want the vllm pod to download the model from hugging face. Should I use –model in args? spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
imagePullPolicy: IfNotPresent
command: [“python3”, “-m”, “vllm.entrypoints.openai.api_server”]
args:- “–model”
Yes, you should use the --model
argument in args
to specify the Hugging Face model name (e.g., --model meta-llama/Llama-3-8B
). This tells vLLM to download the model from Hugging Face if it is not already present in the cache. Make sure to use standard double quotes ("
) and double dashes (--model
), not special or encoded characters, in your manifest. Example:
args: ["--model", "meta-llama/Llama-3-8B"]
For more, see Kubernetes deployment docs and Hugging Face integration.
Would you like a full manifest example or more details on other arguments?
Sources:
provide complete manifest for embeddings using args –model
Here is a complete Kubernetes manifest for running vLLM embeddings, using the --model
argument to download from Hugging Face. This manifest includes a PersistentVolumeClaim, Secret for Hugging Face token, Deployment (with correct args and env), and Service. Replace <YOUR_HF_TOKEN>
and <model_name>
as needed.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-models
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args: ["--model", "<model_name>", "--task", "embed"]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
ports:
- containerPort: 8000
volumeMounts:
- name: llama-storage
mountPath: /root/.cache/huggingface
volumes:
- name: llama-storage
persistentVolumeClaim:
claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
This manifest will download the model from Hugging Face and serve embeddings via the OpenAI-compatible API. For more, see Kubernetes deployment docs.
Would you like a step-by-step explanation of each section?
Sources: