Deploy a big LLM when GPU VRAM not enough

RunLLM · August 13, 2025, 4:46pm

To reduce memory usage for a single user and batch, lower --max-model-len (e.g., 1024 or 2048), set --max-num-seqs 1, and reduce --gpu-memory-utilization (e.g., 0.7). These changes shrink the KV cache and overall VRAM usage. Also, setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True can help avoid fragmentation and may prevent OOM errors, as suggested in multiple vLLM issues.

Example args:

args:
  - "--max-model-len"
  - "1024"
  - "--max-num-seqs"
  - "1"
  - "--gpu-memory-utilization"
  - "0.7"

And set the environment variable in your pod:

env:
  - name: PYTORCH_CUDA_ALLOC_CONF
    value: "expandable_segments:True"

Would you like more detail or a full manifest example?

Sources:

Topic		Replies	Views
Need to serve a Qwen3 LLM with 235B params General	9	171	March 7, 2026
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	648	September 8, 2025
Multi-node K8s GPU pooling Features	3	409	July 17, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	807	March 3, 2026
Error deploy model with ray General	5	629	August 17, 2025

Deploy a big LLM when GPU VRAM not enough

Related topics