Why does each GPU use more than 15GB of memory when running the DeepSeek 14B INT4 model on 4 x 16GB Intel ARC 770 GPUs with tensor parallelism (TP) set to 4 ?
DeepSeek 14B INT4 model weight is only 7-8G around.
Can we know which arguments will affect the GPU memory?
The memory usage on each GPU can be influenced by several factors beyond just the model weights. According to vLLM documentation, the gpu_memory_utilization parameter controls the fraction of GPU memory used for the model executor, which includes model weights, activations, and KV cache. Additionally, enabling CUDA graphs or using certain quantization methods can increase memory usage.
The tensor_parallel_size argument affects how the model is distributed across GPUs, and the max-model-len and max-num-seqs parameters can also impact memory usage by determining the size of the KV cache. If the gpu_memory_utilization is set too high, it may lead to out-of-memory errors due to insufficient space for activations and other runtime memory needs. Adjusting these parameters can help manage memory usage more effectively.
Would you like more detail on how to adjust these parameters for optimal memory usage?