Embeddings with vLLM in Kubernetes

How can I prepare a Kubernetes manifest to accommodate embeddings with vLLM image? We will use service and persistent volume for the weights. What other argumenets and environmental parameters will be needed? The weights will be downloaded from Hugging face

To prepare a Kubernetes manifest for vLLM with embeddings, you need a Deployment (using the vLLM image), a PersistentVolumeClaim (PVC) for model weights, a Secret for your Hugging Face token (if required), and a Service to expose the API. For embeddings, set the --task embed argument in your container’s command or args. You must also mount the PVC to /root/.cache/huggingface and set the HUGGING_FACE_HUB_TOKEN environment variable if accessing gated models. Example arguments: vllm serve <model_name> --task embed.

A minimal manifest includes:

  • PVC for model cache
  • Secret for Hugging Face token
  • Deployment with:
    • vLLM image
    • args: vllm serve <model_name> --task embed
    • env: HUGGING_FACE_HUB_TOKEN from Secret
    • volumeMounts for model cache
  • Service exposing port 8000

See the code example below and refer to Kubernetes deployment docs, embedding task argument, and model config.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: ["vllm serve <model_name> --task embed"]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: llama-storage
          mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

Would you like a more detailed breakdown or explanation of each section?

Sources:

Currently I use this manifest as below. I want the vllm pod to download the model from hugging face. Should I use –model in args? spec:
containers:

  • name: vllm
    image: vllm/vllm-openai:latest
    imagePullPolicy: IfNotPresent
    command: [“python3”, “-m”, “vllm.entrypoints.openai.api_server”]
    args:
    • “–model”

Yes, you should use the --model argument in args to specify the Hugging Face model name (e.g., --model meta-llama/Llama-3-8B). This tells vLLM to download the model from Hugging Face if it is not already present in the cache. Make sure to use standard double quotes (") and double dashes (--model), not special or encoded characters, in your manifest. Example:

args: ["--model", "meta-llama/Llama-3-8B"]

For more, see Kubernetes deployment docs and Hugging Face integration.

Would you like a full manifest example or more details on other arguments?

Sources:

provide complete manifest for embeddings using args –model

Here is a complete Kubernetes manifest for running vLLM embeddings, using the --model argument to download from Hugging Face. This manifest includes a PersistentVolumeClaim, Secret for Hugging Face token, Deployment (with correct args and env), and Service. Replace <YOUR_HF_TOKEN> and <model_name> as needed.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args: ["--model", "<model_name>", "--task", "embed"]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: llama-storage
          mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

This manifest will download the model from Hugging Face and serve embeddings via the OpenAI-compatible API. For more, see Kubernetes deployment docs.

Would you like a step-by-step explanation of each section?

Sources: