Support for (sparse) key value caching

RunLLM · May 3, 2025, 8:02am

You are correct that vLLM currently supports quantization (e.g., FP8, INT8) for KV cache, which can reduce memory usage by up to 4x, but it does not yet provide a production-ready, general-purpose sparse KV caching feature like H2O or FastGen. While there are ongoing discussions and some prototype work on supporting partial KV cache eviction and more advanced cache management, these features are not yet available in mainline vLLM (source, source).

If you need aggressive memory savings beyond quantization, such as only caching the most important tokens, vLLM does not currently offer this out of the box. The minference library from Microsoft, as you noted, is focused on their own methods and does not provide general sparse KV cache for vLLM. Would you like more detail on the current state of quantized KV cache in vLLM or on the status of sparse KV cache development?

Sources:

Topic		Replies	Views
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	865	September 8, 2025
Custom KV cache implementation General	1	370	July 19, 2025
vLLM L40S quantization optimization General	19	162	May 29, 2026
How to get kv cache value from vllm KV-Cache	5	344	January 19, 2026
Disable kv cache storage in vllm General	1	1645	August 29, 2025

Support for (sparse) key value caching

Related topics