I’m running into a vLLM v1 limitation: for very long contexts (e.g. 128k tokens), the KV cache alone is ~20 GB, plus ~14 GB model weights and 2–3 GB runtime overhead, which forces me to reserve ~37 GB of VRAM. vLLM v1 requires pre-allocating a GPU KV pool sized for max_model_len via gpu_memory_utilization, and once reserved, that memory is never released even if I enable LMCache offloading to CPU. Unlike Transformers’ OffloadedCache, there’s no way to bypass the “must fit one request entirely on GPU” capacity check. CPU offload (cpu_offload_gb) tanks performance badly because it randomly assigns layers, and KV quantization is not available in v1. I also need to run vLLM as a Python backend (not serve) for integration with guardrails. In short: how can I reduce GPU memory reservation for long contexts in vLLM v1 without losing throughput, given that gpu_memory_utilization creates a non-reclaimable pool, offload is too slow, and quantization isn’t supported?
Currently, vLLM v1 always pre-allocates a GPU KV cache pool sized for at least one max_model_len request, and this memory is not released even if you enable LMCache or CPU offload; there is no supported way to bypass the “must fit one request entirely on GPU” check or dynamically reclaim KV cache memory in v1. CPU offload (--cpu-offload-gb
) can reduce VRAM usage but significantly degrades throughput, and KV cache quantization is not available in v1 as of now. These limitations are documented and discussed in detail in the vLLM repo and issues, with no official workaround for your scenario other than reducing max_model_len
or using quantized models if/when support lands in v1. See this issue and conserving memory docs.
If you need to run vLLM as a Python backend and cannot use serve
, the same memory constraints apply. The only supported ways to reduce VRAM reservation are to lower max_model_len
, decrease gpu_memory_utilization
, or use tensor parallelism to split the model across multiple GPUs, but all require trade-offs in context length or hardware. There is no v1 feature to dynamically shrink the KV pool or offload only the KV cache without performance loss. Would you like a step-by-step breakdown of these options or more detail on upcoming roadmap features that may address this?
Sources: