vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed.

abhiram1809 · September 2, 2025, 5:01am

I’m running into a vLLM v1 limitation: for very long contexts (e.g. 128k tokens), the KV cache alone is ~20 GB, plus ~14 GB model weights and 2–3 GB runtime overhead, which forces me to reserve ~37 GB of VRAM. vLLM v1 requires pre-allocating a GPU KV pool sized for max_model_len via gpu_memory_utilization, and once reserved, that memory is never released even if I enable LMCache offloading to CPU. Unlike Transformers’ OffloadedCache, there’s no way to bypass the “must fit one request entirely on GPU” capacity check. CPU offload (cpu_offload_gb) tanks performance badly because it randomly assigns layers, and KV quantization is not available in v1. I also need to run vLLM as a Python backend (not serve) for integration with guardrails. In short: how can I reduce GPU memory reservation for long contexts in vLLM v1 without losing throughput, given that gpu_memory_utilization creates a non-reclaimable pool, offload is too slow, and quantization isn’t supported?

RunLLM · September 2, 2025, 5:02am

Currently, vLLM v1 always pre-allocates a GPU KV cache pool sized for at least one max_model_len request, and this memory is not released even if you enable LMCache or CPU offload; there is no supported way to bypass the “must fit one request entirely on GPU” check or dynamically reclaim KV cache memory in v1. CPU offload (--cpu-offload-gb) can reduce VRAM usage but significantly degrades throughput, and KV cache quantization is not available in v1 as of now. These limitations are documented and discussed in detail in the vLLM repo and issues, with no official workaround for your scenario other than reducing max_model_len or using quantized models if/when support lands in v1. See this issue and conserving memory docs.

If you need to run vLLM as a Python backend and cannot use serve, the same memory constraints apply. The only supported ways to reduce VRAM reservation are to lower max_model_len, decrease gpu_memory_utilization, or use tensor parallelism to split the model across multiple GPUs, but all require trade-offs in context length or hardware. There is no v1 feature to dynamically shrink the KV pool or offload only the KV cache without performance loss. Would you like a step-by-step breakdown of these options or more detail on upcoming roadmap features that may address this?

Sources:

Topic		Replies	Views
Support for (sparse) key value caching KV-Cache	16	170	May 3, 2025
The new V1 way to ~--cpu-offload-gb V1 Feedback	5	1421	April 13, 2025
Deploy a big LLM when GPU VRAM not enough General	21	195	August 13, 2025
GLM4.5 V memory leak on inference V1 Feedback	9	33	August 28, 2025
How does VRAM affect concurrent performance General	1	130	May 12, 2025

vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed.

Related topics