yes provide the complete manifest
It seems likely that a minimal Kubernetes manifest for deploying a Qwen model with vLLM, AWQ quantization, CPU offload, and tool calling enabled would look like this (replace placeholders as needed):
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen
spec:
replicas: 1
selector:
matchLabels:
app: vllm-qwen
template:
metadata:
labels:
app: vllm-qwen
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ"
- "--tensor-parallel-size"
- "4"
- "--max-model-len"
- "1024"
- "--quantization"
- "awq"
- "--cpu-offload-gb"
- "512"
- "--enable-auto-tool-choice"
- "--tool-call-parser"
- "hermes"
- "--trust-remote-code"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
resources:
limits:
nvidia.com/gpu: "4"
memory: 64Gi
requests:
nvidia.com/gpu: "4"
memory: 64Gi
volumeMounts:
- name: cache-volume
mountPath: /root/.cache/huggingface
- name: shm
mountPath: /dev/shm
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: qwen-model-cache
- name: shm
emptyDir:
medium: Memory
sizeLimit: "8Gi"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: qwen-model-cache
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 200Gi
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
type: Opaque
stringData:
token: "<your-huggingface-token>"
---
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen
spec:
selector:
app: vllm-qwen
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: ClusterIP
You may need to adjust resource requests/limits and storage as appropriate for your cluster.
Would you like more detail on any section?
Sources:
- https://github.com/vllm-project/vllm/blob/main/docs/deployment/k8s.md
- https://github.com/vllm-project/vllm/issues/13508
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.