Deploy a big LLM when GPU VRAM not enough

To reduce memory usage for a single user and batch, lower --max-model-len (e.g., 1024 or 2048), set --max-num-seqs 1, and reduce --gpu-memory-utilization (e.g., 0.7). These changes shrink the KV cache and overall VRAM usage. Also, setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can help avoid fragmentation and may prevent OOM errors, as suggested in multiple vLLM issues.

Example args:

args:
  - "--max-model-len"
  - "1024"
  - "--max-num-seqs"
  - "1"
  - "--gpu-memory-utilization"
  - "0.7"

And set the environment variable in your pod:

env:
  - name: PYTORCH_CUDA_ALLOC_CONF
    value: "expandable_segments:True"

Would you like more detail or a full manifest example?

Sources: