Why vLLM uses alot of CPU memory

khayamgondal · April 21, 2025, 9:35pm

I am running LLama 70b FP8, entire model and inference run fit on GPU but I still see around 100GB of CPU RAM usage. Why does vLLM use CPU memory even when inference is only running on GPU?

RunLLM · April 21, 2025, 9:35pm

vLLM uses CPU memory for several reasons, even when the inference is running on the GPU. One primary reason is the allocation of CPU swap space for GPU memory management. According to vLLM’s configuration, vLLM reserves a portion of CPU memory as swap space to manage GPU memory more efficiently. This swap space is used to offload parts of the model or intermediate data that do not fit into the GPU memory, allowing for more efficient memory utilization and preventing GPU memory overflow.

Additionally, vLLM may use CPU memory for tasks such as model loading, preprocessing, and managing the KV cache, which is essential for handling large models and ensuring smooth inference operations. The CPU memory usage can also be influenced by the configuration of the cpu_offload_gb parameter, which allows offloading some model weights to the CPU to reduce GPU memory usage, as described in the vLLM documentation.

Would you like more detail on how to configure these settings to optimize memory usage?

Sources:

Topic		Replies	Views
vLLM Hihg CPU usage when doin nothing General	5	191	July 24, 2025
Run vLLM on two diffrent GPU General	1	369	May 21, 2025
GLM4.5 V memory leak on inference V1 Feedback	9	119	August 28, 2025
Does vllm support inference or service startup of CPU small model? Hardware Support	3	95	May 30, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	130	September 8, 2025

Why vLLM uses alot of CPU memory

Related topics