Proper settings for running qwen2.5 72b on 48gb using awq

The out-of-memory (OOM) error occurs because, even with AWQ quantization, Qwen2.5-72B-Instruct-AWQ requires more VRAM than 2x24GB 3090s can provide for both model weights and KV cache, especially with your current settings. vLLM’s memory usage is affected by max-model-len, max_num_batched_tokens, and CUDA graph compilation, which uses extra memory in V1 engine. Reducing --max-model-len, --max_num_batched_tokens, or --gpu_memory_utilization, and setting --enforce-eager to disable CUDA graphs can help lower memory usage. Also, ensure you are passing --quantization=awq and --dtype float16 if not already set, as AWQ does not support bfloat16 in vLLM.

Even with quantization, 72B models are extremely memory-intensive, and users have reported OOM on 4xA100 40GB with similar settings. You may need to further reduce batch size, context length, or increase tensor parallelism (if more GPUs are available). For more details and example configs, see this issue and conserving memory guide.

Would you like a step-by-step breakdown or example command for your setup?

Sources: