I want to implement heterogeneous inference between the CPU and GPU. This may require the CPU and GPU to collaborate to complete the inference of a request, which involves transferring the KV cache between a CPU instance and a GPU instance. I would like to ask, how can I transfer the KV cache between a CPU instance and a GPU instance? Is there an API in VLLM that can achieve this? Or do I need to implement it myself?
There is currently no built-in vLLM API that directly supports transferring the KV cache between a CPU instance and a GPU instance for heterogeneous inference. The existing KV cache transfer mechanisms in vLLM (such as disaggregated prefilling and connectors like PyNcclConnector or LMCacheConnector) are designed for GPU-to-GPU or distributed GPU scenarios, not for CPU-GPU collaboration or KV cache migration between CPU and GPU instances. You would need to implement this functionality yourself if you require CPU-GPU KV cache transfer for heterogeneous inference, as it is not explicitly supported in the current vLLM APIs or documentation according to the official docs, KV transfer design, and related issues.
Would you like more detail on how the current KV cache transfer works or guidance on implementing your own solution?
Sources: