Which arguments affect GPU memory

Shawn · April 29, 2025, 4:51am

Why does each GPU use more than 15GB of memory when running the DeepSeek 14B INT4 model on 4 x 16GB Intel ARC 770 GPUs with tensor parallelism (TP) set to 4 ?

DeepSeek 14B INT4 model weight is only 7-8G around.

Can we know which arguments will affect the GPU memory?

RunLLM · April 29, 2025, 4:52am

The memory usage on each GPU can be influenced by several factors beyond just the model weights. According to vLLM documentation, the gpu_memory_utilization parameter controls the fraction of GPU memory used for the model executor, which includes model weights, activations, and KV cache. Additionally, enabling CUDA graphs or using certain quantization methods can increase memory usage.

The tensor_parallel_size argument affects how the model is distributed across GPUs, and the max-model-len and max-num-seqs parameters can also impact memory usage by determining the size of the KV cache. If the gpu_memory_utilization is set too high, it may lead to out-of-memory errors due to insufficient space for activations and other runtime memory needs. Adjusting these parameters can help manage memory usage more effectively.

Would you like more detail on how to adjust these parameters for optimal memory usage?

Sources:

vLLM documentation

Topic		Replies	Views
Why vLLM uses alot of CPU memory General	1	287	April 21, 2025
About monitor the usage of KV cache memory General	1	47	May 24, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	82	May 21, 2025
Run vLLM on two diffrent GPU General	1	80	May 21, 2025
Why vllm cannot fully use GPU in batch processing General	12	215	March 29, 2025

Which arguments affect GPU memory

Related topics