Disable kv cache storage in vllm

littletomatodonkey · August 29, 2025, 2:01am

I have an Language model and now i just need it to output 1 token, so kv cache is actually not used for my scene, is there any methods to avoid cache kv for inference in vllm?

RunLLM · August 29, 2025, 2:01am

No, vLLM always computes and stores the KV cache, even when max_tokens=1. There is currently no official configuration or flag in vLLM to disable KV cache for inference, and attempts to set use_cache=False (as in Hugging Face) are not supported in vLLM. Disabling or bypassing KV cache would require custom code modifications or monkey-patching, which is not recommended for production use and is not natively supported by vLLM’s APIs or configuration options. See discussion and code references in this issue and this issue.

Would you like more detail or code-level guidance on possible workarounds?

Sources:

Topic		Replies	Views
Custom KV cache implementation General	1	186	July 19, 2025
Support for (sparse) key value caching KV-Cache	16	374	May 3, 2025
Gigantic tmp directory General	10	90	July 15, 2025
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	301	September 8, 2025
How should I set kv cache in vllm? General	3	75	November 18, 2025

Disable kv cache storage in vllm

Related topics