Is it possible to offload KV cache to DRAM or nvme and if yes, how?
Yes, it is possible to offload KV cache to DRAM (CPU memory) in vLLM. This can be achieved using built-in CPU offloading features or external tools like LMCache. For built-in support, you can use the --cpu-offload-gb or --swap-space CLI arguments to specify the amount of CPU memory to use for KV cache offloading. For example, --cpu-offload-gb 10 will offload up to 10 GiB of KV cache per GPU to CPU RAM. Disk/NVMe offloading is not natively supported in vLLM core as of now, but can be achieved using LMCache or by implementing a custom KV connector (e.g., SharedStorageConnector) that saves KV cache to disk or NVMe. See the CLI documentation and LMCache example for details.
For disk/NVMe offloading, LMCache provides a solution by acting as a remote KV cache server, allowing vLLM to store and retrieve KV cache from disk or NVMe. You need to install LMCache, configure the LMCache server, and set the appropriate environment variables and kv_transfer_config in vLLM to use the LMCacheConnectorV1. See the LMCache integration guide for step-by-step instructions.
Would you like a detailed step-by-step example for either DRAM or NVMe offloading?
Sources:
Hint: Mention @RunLLM in the post for followups.
Will offloading kv cache to DRAM free more VRAM to achieve more simultaneous requests?
That depends on your model and hardware characteristics. Offloading will actually slow things down quite a bit because memory oversubscription is not free even on tightly coupled devices like the GH200.
@RunLLM
Hi, I would like to analyse the difference between CPU-Offloading without LMCache and with LMCache. Are there any examples of those cases in vLLM?
Yes, there are official examples in vLLM that compare CPU offloading with and without LMCache. The script cpu_offload_lmcache.py demonstrates both cases: running with only vLLM’s built-in CPU offloading and running with LMCache-enabled offloading. You can run the script without LMCache to see baseline CPU offloading, and then with LMCache enabled (using the --enable-lmcache flag) to observe the performance difference. See the LMCache example documentation and the example script for details.
Would you like a step-by-step breakdown or code snippet for running these examples?
Sources:
Hint: Mention @RunLLM in the post for followups.