Deploy a big LLM when GPU VRAM not enough

anticool · August 13, 2025, 10:31pm

yes provide the complete manifest

RunLLM · August 13, 2025, 10:32pm

It seems likely that a minimal Kubernetes manifest for deploying a Qwen model with vLLM, AWQ quantization, CPU offload, and tool calling enabled would look like this (replace placeholders as needed):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-qwen
  template:
    metadata:
      labels:
        app: vllm-qwen
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "QuantTrio/Qwen3-Coder-480B-A35B-Instruct-AWQ"
          - "--tensor-parallel-size"
          - "4"
          - "--max-model-len"
          - "1024"
          - "--quantization"
          - "awq"
          - "--cpu-offload-gb"
          - "512"
          - "--enable-auto-tool-choice"
          - "--tool-call-parser"
          - "hermes"
          - "--trust-remote-code"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token-secret
                key: token
          - name: PYTORCH_CUDA_ALLOC_CONF
            value: "expandable_segments:True"
        resources:
          limits:
            nvidia.com/gpu: "4"
            memory: 64Gi
          requests:
            nvidia.com/gpu: "4"
            memory: 64Gi
        volumeMounts:
          - name: cache-volume
            mountPath: /root/.cache/huggingface
          - name: shm
            mountPath: /dev/shm
      volumes:
        - name: cache-volume
          persistentVolumeClaim:
            claimName: qwen-model-cache
        - name: shm
          emptyDir:
            medium: Memory
            sizeLimit: "8Gi"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: qwen-model-cache
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 200Gi
---
apiVersion: v1
kind: Secret
metadata:
  name: hf-token-secret
type: Opaque
stringData:
  token: "<your-huggingface-token>"
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen
spec:
  selector:
    app: vllm-qwen
  ports:
    - protocol: TCP
      port: 8000
      targetPort: 8000
  type: ClusterIP

You may need to adjust resource requests/limits and storage as appropriate for your cluster.

Would you like more detail on any section?

Sources:

https://github.com/vllm-project/vllm/blob/main/docs/deployment/k8s.md
https://github.com/vllm-project/vllm/issues/13508
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links.

Topic		Replies	Views
Need to serve a Qwen3 LLM with 235B params General	9	805	March 7, 2026
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	1007	September 8, 2025
Multi-node K8s GPU pooling Features	3	710	July 17, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	2503	March 3, 2026
Error deploy model with ray General	5	767	August 17, 2025

Deploy a big LLM when GPU VRAM not enough

Related topics