vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed.

I’m running into a vLLM v1 limitation: for very long contexts (e.g. 128k tokens), the KV cache alone is ~20 GB, plus ~14 GB model weights and 2–3 GB runtime overhead, which forces me to reserve ~37 GB of VRAM. vLLM v1 requires pre-allocating a GPU KV pool sized for max_model_len via gpu_memory_utilization, and once reserved, that memory is never released even if I enable LMCache offloading to CPU. Unlike Transformers’ OffloadedCache, there’s no way to bypass the “must fit one request entirely on GPU” capacity check. CPU offload (cpu_offload_gb) tanks performance badly because it randomly assigns layers, and KV quantization is not available in v1. I also need to run vLLM as a Python backend (not serve) for integration with guardrails. In short: how can I reduce GPU memory reservation for long contexts in vLLM v1 without losing throughput, given that gpu_memory_utilization creates a non-reclaimable pool, offload is too slow, and quantization isn’t supported?

Currently, vLLM v1 always pre-allocates a GPU KV cache pool sized for at least one max_model_len request, and this memory is not released even if you enable LMCache or CPU offload; there is no supported way to bypass the “must fit one request entirely on GPU” check or dynamically reclaim KV cache memory in v1. CPU offload (--cpu-offload-gb) can reduce VRAM usage but significantly degrades throughput, and KV cache quantization is not available in v1 as of now. These limitations are documented and discussed in detail in the vLLM repo and issues, with no official workaround for your scenario other than reducing max_model_len or using quantized models if/when support lands in v1. See this issue and conserving memory docs.

If you need to run vLLM as a Python backend and cannot use serve, the same memory constraints apply. The only supported ways to reduce VRAM reservation are to lower max_model_len, decrease gpu_memory_utilization, or use tensor parallelism to split the model across multiple GPUs, but all require trade-offs in context length or hardware. There is no v1 feature to dynamically shrink the KV pool or offload only the KV cache without performance loss. Would you like a step-by-step breakdown of these options or more detail on upcoming roadmap features that may address this?

Sources:

Currently, vLLM allocates GPU memory at startup based on the gpu_memory_utilization parameter and does not adjust this value after KV cache offloading. Once KV blocks are moved to CPU/local RAM, the GPU-side buffers remain reserved in the memory pool, so the freed capacity cannot be reused.

It should be technically possible to modify vLLM’s backend so that:

GPU KV Buffers are explicitly freed after offload (calling del on the GPU tensors, followed by torch.cuda.empty_cache() or equivalent CUDA free APIs) or maybe Memory metrics and gpu_memory_utilization are recomputed dynamically by querying actual GPU usage (torch.cuda.memory_allocated() vs. total memory). This would allow reporting accurate utilization at runtime instead of keeping the initial static value. Executor/allocator pool could also be shrunk in response, making the newly available GPU memory usable for subsequent allocations.

This approach would let us reclaim significant GPU memory deterministically during long-running sessions with heavy KV offload. The only caveat is that fragmentation in CUDA’s allocator might prevent 100% reclamation in worst case scenario, a process restart remains the only way to fully reset GPU usage.

Your understanding is correct: vLLM v1 does not currently reclaim or shrink the pre-allocated GPU KV cache pool after offloading KV blocks to CPU; the reserved GPU memory remains allocated and unavailable for other uses, even if the actual KV data is offloaded. There is no built-in mechanism to explicitly free or shrink the GPU-side KV buffers or to dynamically recompute gpu_memory_utilization at runtime—these features are not supported in vLLM v1 as of now. See this issue and this feature request.

While your proposed approach (explicitly deleting GPU tensors, calling torch.cuda.empty_cache(), and shrinking the allocator pool) is technically feasible, it is not implemented in vLLM v1. The only way to fully reclaim GPU memory is to restart the process. Fragmentation in CUDA’s allocator may also prevent full reclamation, as you noted. Would you like a step-by-step explanation of how such a feature could be implemented, or more detail on current workarounds?

Sources:

Could you point me to the files that would need change so that i can change them myself on a vllm fork

To implement dynamic GPU KV cache reclamation or shrinking in vLLM v1, you would need to modify the core memory allocation and KV cache management logic. The most relevant files are:

  • vllm/config/cache.py: Defines the cache configuration, including gpu_memory_utilization and related parameters.
  • The memory allocation and KV cache pool logic is handled in the engine and worker code, especially:
    • vllm/v1/engine/core.py (see _initialize_kv_caches)
    • vllm/v1/core/kv_cache_utils.py (see get_kv_cache_configs, check_enough_kv_cache_memory)
    • vllm/worker/worker.py (for actual tensor allocation and management)
  • You may also need to adjust the allocator logic and any code that tracks or reports memory usage, such as the executor or scheduler.

There is currently no built-in API to shrink or free the KV pool at runtime, so you would need to add logic to explicitly delete GPU KV tensors and call torch.cuda.empty_cache() after offloading, and then update the allocator’s bookkeeping. Would you like a more detailed step-by-step on which functions to patch?

Sources: