Why vLLM uses alot of CPU memory

I am running LLama 70b FP8, entire model and inference run fit on GPU but I still see around 100GB of CPU RAM usage. Why does vLLM use CPU memory even when inference is only running on GPU?

vLLM uses CPU memory for several reasons, even when the inference is running on the GPU. One primary reason is the allocation of CPU swap space for GPU memory management. According to vLLM’s configuration, vLLM reserves a portion of CPU memory as swap space to manage GPU memory more efficiently. This swap space is used to offload parts of the model or intermediate data that do not fit into the GPU memory, allowing for more efficient memory utilization and preventing GPU memory overflow.

Additionally, vLLM may use CPU memory for tasks such as model loading, preprocessing, and managing the KV cache, which is essential for handling large models and ensuring smooth inference operations. The CPU memory usage can also be influenced by the configuration of the cpu_offload_gb parameter, which allows offloading some model weights to the CPU to reduce GPU memory usage, as described in the vLLM documentation.

Would you like more detail on how to configure these settings to optimize memory usage?

Sources: