Possible to offload KV cache to DRAM or nvme?

youlearnit · October 1, 2025, 8:00pm

Is it possible to offload KV cache to DRAM or nvme and if yes, how?

RunLLM · October 1, 2025, 8:01pm

Yes, it is possible to offload KV cache to DRAM (CPU memory) in vLLM. This can be achieved using built-in CPU offloading features or external tools like LMCache. For built-in support, you can use the --cpu-offload-gb or --swap-space CLI arguments to specify the amount of CPU memory to use for KV cache offloading. For example, --cpu-offload-gb 10 will offload up to 10 GiB of KV cache per GPU to CPU RAM. Disk/NVMe offloading is not natively supported in vLLM core as of now, but can be achieved using LMCache or by implementing a custom KV connector (e.g., SharedStorageConnector) that saves KV cache to disk or NVMe. See the CLI documentation and LMCache example for details.

For disk/NVMe offloading, LMCache provides a solution by acting as a remote KV cache server, allowing vLLM to store and retrieve KV cache from disk or NVMe. You need to install LMCache, configure the LMCache server, and set the appropriate environment variables and kv_transfer_config in vLLM to use the LMCacheConnectorV1. See the LMCache integration guide for step-by-step instructions.

Would you like a detailed step-by-step example for either DRAM or NVMe offloading?

Sources:

Hint: Mention @RunLLM in the post for followups.

youlearnit · October 1, 2025, 8:03pm

Will offloading kv cache to DRAM free more VRAM to achieve more simultaneous requests?

rajesh-s · October 7, 2025, 2:44am

That depends on your model and hardware characteristics. Offloading will actually slow things down quite a bit because memory oversubscription is not free even on tightly coupled devices like the GH200.

NathanSmart · October 24, 2025, 3:41am

@RunLLM
Hi, I would like to analyse the difference between CPU-Offloading without LMCache and with LMCache. Are there any examples of those cases in vLLM?

RunLLM · October 24, 2025, 3:41am

Yes, there are official examples in vLLM that compare CPU offloading with and without LMCache. The script cpu_offload_lmcache.py demonstrates both cases: running with only vLLM’s built-in CPU offloading and running with LMCache-enabled offloading. You can run the script without LMCache to see baseline CPU offloading, and then with LMCache enabled (using the --enable-lmcache flag) to observe the performance difference. See the LMCache example documentation and the example script for details.

Would you like a step-by-step breakdown or code snippet for running these examples?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	759	September 8, 2025
Is there any roadmap to support prefix caching on dram and disk? Disaggregated Prefilling	1	144	July 25, 2025
Support for (sparse) key value caching KV-Cache	16	599	May 3, 2025
How to do KV cache transfer between a CPU instance and a GPU instance? KV-Cache	1	277	July 31, 2025
Will CPU Offload be supported in V1? General	3	1153	March 24, 2025

Possible to offload KV cache to DRAM or nvme?

Related topics