Embeddings with vLLM in Kubernetes

anticool · August 28, 2025, 10:11am

How can I prepare a Kubernetes manifest to accommodate embeddings with vLLM image? We will use service and persistent volume for the weights. What other argumenets and environmental parameters will be needed? The weights will be downloaded from Hugging face

RunLLM · August 28, 2025, 10:13am

To prepare a Kubernetes manifest for vLLM with embeddings, you need a Deployment (using the vLLM image), a PersistentVolumeClaim (PVC) for model weights, a Secret for your Hugging Face token (if required), and a Service to expose the API. For embeddings, set the --task embed argument in your container’s command or args. You must also mount the PVC to /root/.cache/huggingface and set the HUGGING_FACE_HUB_TOKEN environment variable if accessing gated models. Example arguments: vllm serve <model_name> --task embed.

A minimal manifest includes:

PVC for model cache
Secret for Hugging Face token
Deployment with:
- vLLM image
- args: vllm serve <model_name> --task embed
- env: HUGGING_FACE_HUB_TOKEN from Secret
- volumeMounts for model cache
Service exposing port 8000

See the code example below and refer to Kubernetes deployment docs, embedding task argument, and model config.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes: [ReadWriteOnce]
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["/bin/sh", "-c"]
        args: ["vllm serve <model_name> --task embed"]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: llama-storage
          mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

Would you like a more detailed breakdown or explanation of each section?

Sources:

anticool · August 28, 2025, 10:18am

Currently I use this manifest as below. I want the vllm pod to download the model from hugging face. Should I use –model in args? spec:
containers:

name: vllm
image: vllm/vllm-openai:latest
imagePullPolicy: IfNotPresent
command: [“python3”, “-m”, “vllm.entrypoints.openai.api_server”]
args:
- “–model”

RunLLM · August 28, 2025, 10:18am

Yes, you should use the --model argument in args to specify the Hugging Face model name (e.g., --model meta-llama/Llama-3-8B). This tells vLLM to download the model from Hugging Face if it is not already present in the cache. Make sure to use standard double quotes (") and double dashes (--model), not special or encoded characters, in your manifest. Example:

args: ["--model", "meta-llama/Llama-3-8B"]

For more, see Kubernetes deployment docs and Hugging Face integration.

Would you like a full manifest example or more details on other arguments?

Sources:

Kubernetes deployment docs

anticool · August 28, 2025, 10:20am

provide complete manifest for embeddings using args –model

RunLLM · August 28, 2025, 10:21am

Here is a complete Kubernetes manifest for running vLLM embeddings, using the --model argument to download from Hugging Face. This manifest includes a PersistentVolumeClaim, Secret for Hugging Face token, Deployment (with correct args and env), and Service. Replace <YOUR_HF_TOKEN> and <model_name> as needed.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-models
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<YOUR_HF_TOKEN>"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args: ["--model", "<model_name>", "--task", "embed"]
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token-secret
              key: token
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: llama-storage
          mountPath: /root/.cache/huggingface
      volumes:
      - name: llama-storage
        persistentVolumeClaim:
          claimName: vllm-models
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000
  type: ClusterIP

This manifest will download the model from Hugging Face and serve embeddings via the OpenAI-compatible API. For more, see Kubernetes deployment docs.

Would you like a step-by-step explanation of each section?

Sources:

Kubernetes deployment docs

Topic		Replies	Views
Deploy a big LLM when GPU VRAM not enough General	21	756	August 13, 2025
Prompt_embeds usage in vllm openai completion api Multi-modality	4	70	June 17, 2025
How to start embedding models General	7	344	March 20, 2025
Hosting LLM using vLLM in cloud General	1	32	July 28, 2025
vLLM Load balancing General	1	413	March 24, 2025

Embeddings with vLLM in Kubernetes

Related topics