Cuda failure 'out of memory'

lcf2610 · April 24, 2025, 9:49am

I use the benchmark_throughput.py to run the llama3.1-70B model. The command is:

python3 benchmark_throughput.py --model "/model/Meta-Llama-3.1-70B-Instruct" --dataset-name "sonnet" --dataset-path "/vllm-main/benchmarks/sonnet.txt" --num-prompts 100

My GPU is 4 L40s and the parameters I use is:

gpu_memory_utilization = 0.82,
tensor_parallel_size = 4,
max_model_len = 1024,
max_num_seqs = 20,
enable_prefix_caching = True,

I found a strange phenomenon. All 100 prompts can be inferred, but rank 1, 2, and 3 report the oom error when the inference ends. I don’t quite understand why there is an oom error after the inference is complete rather than during the inference.

RunLLM · April 24, 2025, 9:49am

The out-of-memory (OOM) error occurring after inference rather than during it could be due to the way memory is managed and released at the end of the process. According to vLLM’s troubleshooting guide, vLLM is designed to occupy all available GPU memory for storing KV cache blocks, which can lead to OOM errors if memory is not properly released or if there is a memory leak.

Additionally, the vLLM GitHub issues suggest that setting gpu_memory_utilization to a lower value can help manage memory usage more effectively. You might want to try reducing the gpu_memory_utilization parameter to see if it alleviates the OOM error. Also, ensure that any other processes using the GPU are minimized to free up memory.

Would you like more detailed guidance on adjusting these parameters or further troubleshooting steps?

Sources:

Topic		Replies	Views
Proper settings for running qwen2.5 72b on 48gb using awq General	1	26	June 21, 2025
Multimodal inference guideline? General	41	128	June 27, 2025
Why vllm cannot fully use GPU in batch processing General	12	238	March 29, 2025
Why vLLM uses alot of CPU memory General	1	311	April 21, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	97	May 21, 2025

Cuda failure 'out of memory'

Related topics