To reduce memory usage for a single user and batch, lower --max-model-len (e.g., 1024 or 2048), set --max-num-seqs 1, and reduce --gpu-memory-utilization (e.g., 0.7). These changes shrink the KV cache and overall VRAM usage. Also, setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can help avoid fragmentation and may prevent OOM errors, as suggested in multiple vLLM issues.
Example args:
args:
- "--max-model-len"
- "1024"
- "--max-num-seqs"
- "1"
- "--gpu-memory-utilization"
- "0.7"
And set the environment variable in your pod:
env:
- name: PYTORCH_CUDA_ALLOC_CONF
value: "expandable_segments:True"
Would you like more detail or a full manifest example?
Sources: