Proper settings for running qwen2.5 72b on 48gb using awq

@runvllm

Experiencing a out of memory error when running qwen2.5 72b instruct AWQ. I am able to run this model with tabbyapi on exl2, and other inference engines, but looking to migrate to vllm.

I am running on 2x24gb 3090s.

My VLLM config:

                vllm serve Qwen/Qwen2.5-72B-Instruct-AWQ \
                  --trust-remote-code \
                  --enable-chunked-prefill \
                  --max_num_batched_tokens 1024 \
                  --served-model-name default qwen2.5-72b \
                  --tensor-parallel-size 2 \
                  --tool-call-parser hermes \
                  --enable-auto-tool-choice \
                  --max-model-len 8192 \
                  --gpu_memory_utilization 0.95

VLLM log:

(VllmWorker rank=1 pid=104) ERROR 06-21 11:39:13 [multiproc_executor.py:380] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 30.00 MiB. GPU 1 has a total capacity of 23.48 GiB of which 30.25 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, with 146.00 MiB allocated in private pools (e.g., CUDA Graphs), and 138.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(VllmWorker rank=1 pid=104) ERROR 06-21 11:39:13 [multiproc_executor.py:380]
ERROR 06-21 11:39:13 [core.py:387] EngineCore hit an exception: Traceback (most recent call last):
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 378, in run_engine_core
ERROR 06-21 11:39:13 [core.py:387]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-21 11:39:13 [core.py:387]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 320, in __init__
ERROR 06-21 11:39:13 [core.py:387]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 06-21 11:39:13 [core.py:387]     self._initialize_kv_caches(vllm_config)
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 160, in _initialize_kv_caches
ERROR 06-21 11:39:13 [core.py:387]     self.model_executor.initialize_from_config(kv_cache_configs)
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 63, in initialize_from_config
ERROR 06-21 11:39:13 [core.py:387]     self.collective_rpc("compile_or_warm_up_model")
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 06-21 11:39:13 [core.py:387]     raise e
ERROR 06-21 11:39:13 [core.py:387]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 06-21 11:39:13 [core.py:387]     raise RuntimeError(
ERROR 06-21 11:39:13 [core.py:387] RuntimeError: ('Worker failed with error %s, please check the stack trace above for the root cause', 'CUDA out of memory. Tried to allocate 30.00 MiB. GPU 0 has a total capacity of 23.48 GiB of which 30.25 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 21.51 GiB is allocated by PyTorch, with 146.00 MiB allocated in private pools (e.g., CUDA Graphs), and 138.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)')
ERROR 06-21 11:39:13 [core.py:387]
CRITICAL 06-21 11:39:13 [core_client.py:359] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

This should fit without issue as the AWQ model is 21GB in total, so looking for help.

The out-of-memory (OOM) error occurs because, even with AWQ quantization, Qwen2.5-72B-Instruct-AWQ requires more VRAM than 2x24GB 3090s can provide for both model weights and KV cache, especially with your current settings. vLLM’s memory usage is affected by max-model-len, max_num_batched_tokens, and CUDA graph compilation, which uses extra memory in V1 engine. Reducing --max-model-len, --max_num_batched_tokens, or --gpu_memory_utilization, and setting --enforce-eager to disable CUDA graphs can help lower memory usage. Also, ensure you are passing --quantization=awq and --dtype float16 if not already set, as AWQ does not support bfloat16 in vLLM.

Even with quantization, 72B models are extremely memory-intensive, and users have reported OOM on 4xA100 40GB with similar settings. You may need to further reduce batch size, context length, or increase tensor parallelism (if more GPUs are available). For more details and example configs, see this issue and conserving memory guide.

Would you like a step-by-step breakdown or example command for your setup?

Sources: