Proper settings for running qwen2.5 72b on 48gb using awq

RunLLM · June 21, 2025, 11:52pm

The out-of-memory (OOM) error occurs because, even with AWQ quantization, Qwen2.5-72B-Instruct-AWQ requires more VRAM than 2x24GB 3090s can provide for both model weights and KV cache, especially with your current settings. vLLM’s memory usage is affected by max-model-len, max_num_batched_tokens, and CUDA graph compilation, which uses extra memory in V1 engine. Reducing --max-model-len, --max_num_batched_tokens, or --gpu_memory_utilization, and setting --enforce-eager to disable CUDA graphs can help lower memory usage. Also, ensure you are passing --quantization=awq and --dtype float16 if not already set, as AWQ does not support bfloat16 in vLLM.

Even with quantization, 72B models are extremely memory-intensive, and users have reported OOM on 4xA100 40GB with similar settings. You may need to further reduce batch size, context length, or increase tensor parallelism (if more GPUs are available). For more details and example configs, see this issue and conserving memory guide.

Would you like a step-by-step breakdown or example command for your setup?

Sources:

Topic		Replies	Views
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	1	612	July 23, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1585	March 3, 2026
Problem with Qwen2.5-VL-7b General	3	1389	July 9, 2025
Error deploy model with ray General	5	697	August 17, 2025
Deploy a big LLM when GPU VRAM not enough General	21	4128	August 13, 2025

Proper settings for running qwen2.5 72b on 48gb using awq

Related topics