You are correct that vLLM currently supports quantization (e.g., FP8, INT8) for KV cache, which can reduce memory usage by up to 4x, but it does not yet provide a production-ready, general-purpose sparse KV caching feature like H2O or FastGen. While there are ongoing discussions and some prototype work on supporting partial KV cache eviction and more advanced cache management, these features are not yet available in mainline vLLM (source, source).
If you need aggressive memory savings beyond quantization, such as only caching the most important tokens, vLLM does not currently offer this out of the box. The minference library from Microsoft, as you noted, is focused on their own methods and does not provide general sparse KV cache for vLLM. Would you like more detail on the current state of quantized KV cache in vLLM or on the status of sparse KV cache development?
Sources: