Deploy a big LLM when GPU VRAM not enough

yes provide the complete manifest

It seems likely that a minimal Kubernetes manifest for deploying a Qwen model with vLLM, AWQ quantization, CPU offload, and tool calling enabled would look like this (replace placeholders as needed):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-qwen
  template:
    metadata:
      labels:
        app: vllm-qwen
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ"
          - "--tensor-parallel-size"
          - "4"
          - "--max-model-len"
          - "1024"
          - "--quantization"
          - "awq"
          - "--cpu-offload-gb"
          - "512"
          - "--enable-auto-tool-choice"
          - "--tool-call-parser"
          - "hermes"
          - "--trust-remote-code"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token-secret
                key: token
          - name: PYTORCH_CUDA_ALLOC_CONF
            value: "expandable_segments:True"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: 64Gi
          requests:
            nvidia.com/gpu: "4"
            memory: 64Gi
        volumeMounts:
          - name: cache-volume
            mountPath: /root/.cache/huggingface
          - name: shm
            mountPath: /dev/shm
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: qwen-model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "8Gi"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen-model-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<your-huggingface-token>"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen
spec:
  selector:
    app: vllm-qwen
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

You may need to adjust resource requests/limits and storage as appropriate for your cluster.

Would you like more detail on any section?

Sources: