Deploy a big LLM when GPU VRAM not enough

RunLLM · August 13, 2025, 4:44pm

Even with AWQ quantization and --cpu-offload-gb, vLLM still requires enough VRAM to hold part of the model and the KV cache. If the model and cache together exceed available VRAM, you will get CUDA OOM errors. The --cpu-offload-gb only offloads model weights, not the KV cache, which must fit in VRAM. For extremely large models like Qwen3-Coder-480B, even with quantization and offload, 44GB VRAM per GPU may not be enough, especially with a sequence length of 4096 and 4-way tensor parallelism. See AutoAWQ documentation and CPU offload usage.

Would you like more detail on how to further reduce memory usage or debug this?

Sources:

Topic		Replies	Views
Need to serve a Qwen3 LLM with 235B params General	9	136	March 7, 2026
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	626	September 8, 2025
Multi-node K8s GPU pooling Features	3	399	July 17, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	668	March 3, 2026
Error deploy model with ray General	5	624	August 17, 2025

Deploy a big LLM when GPU VRAM not enough

Related topics