After skimming through the recent code. I think preemption with swapping option is not used anymore. I am thinking about why don’t why use a KV connectors to offload the KV cache blocks in case a request is preempted (instead of recomputing we save it into LMCache lower storage layer). From the document, they said that re-computing is faster than swapping, but I think there should be some benchmark saying this?
In vLLM V1, preemption by recomputation is the default and preferred strategy because it has lower overhead than swapping (offloading KV cache to CPU or other storage). Swapping is only used for multi-sequence requests (e.g., beam search), as recomputation is not supported there. The documentation and code confirm that recomputation is generally more efficient, and swapping is deprecated for most cases. This is based on benchmarks and design discussions showing that recomputation overhead is never higher than 20% of swapping’s latency, and for small block sizes, recomputation is much faster due to the high cost of data movement in swapping. See the discussion and benchmarks in this issue and the optimization guide.
If you want to use a KV connector (like LMCache) for offloading, it is possible, but the current vLLM design intentionally favors recomputation for performance reasons. Swapping/offloading is being re-evaluated for future extensibility, but for now, recomputation is the recommended and benchmarked approach for preemption in V1.
Would you like more detail on the benchmarks or implementation?