How to do KV cache transfer between a CPU instance and a GPU instance?

sjtu-zwh · July 31, 2025, 3:01pm

I want to implement heterogeneous inference between the CPU and GPU. This may require the CPU and GPU to collaborate to complete the inference of a request, which involves transferring the KV cache between a CPU instance and a GPU instance. I would like to ask, how can I transfer the KV cache between a CPU instance and a GPU instance? Is there an API in VLLM that can achieve this? Or do I need to implement it myself?

RunLLM · July 31, 2025, 3:02pm

There is currently no built-in vLLM API that directly supports transferring the KV cache between a CPU instance and a GPU instance for heterogeneous inference. The existing KV cache transfer mechanisms in vLLM (such as disaggregated prefilling and connectors like PyNcclConnector or LMCacheConnector) are designed for GPU-to-GPU or distributed GPU scenarios, not for CPU-GPU collaboration or KV cache migration between CPU and GPU instances. You would need to implement this functionality yourself if you require CPU-GPU KV cache transfer for heterogeneous inference, as it is not explicitly supported in the current vLLM APIs or documentation according to the official docs, KV transfer design, and related issues.

Would you like more detail on how the current KV cache transfer works or guidance on implementing your own solution?

Sources:

Topic		Replies	Views
Is KV Transfer Config supported in V1 for running disaggregated benchmarks? General	1	82	July 7, 2025
Support for (sparse) key value caching KV-Cache	16	157	May 3, 2025
Custom KV cache implementation General	1	55	July 19, 2025
Disable kv cache storage in vllm General	1	16	August 29, 2025
Why vLLM uses alot of CPU memory General	1	631	April 21, 2025

How to do KV cache transfer between a CPU instance and a GPU instance?

Related topics