Even with AWQ quantization and --cpu-offload-gb, vLLM still requires enough VRAM to hold part of the model and the KV cache. If the model and cache together exceed available VRAM, you will get CUDA OOM errors. The --cpu-offload-gb only offloads model weights, not the KV cache, which must fit in VRAM. For extremely large models like Qwen3-Coder-480B, even with quantization and offload, 44GB VRAM per GPU may not be enough, especially with a sequence length of 4096 and 4-way tensor parallelism. See AutoAWQ documentation and CPU offload usage.
Would you like more detail on how to further reduce memory usage or debug this?
Sources: