Don't get more KV Cache when running with --enforce-eager

mahdizaferanchi · September 3, 2025, 12:58pm

When I run vLLM with –enforce-eager, I’m expecting less vram usage. And I no longer see the log message about the memory cuda graphs use (0.53GB in my case). But this doesn’t lead to higher KV cache size. The cache size is exactly the same in both cases (241,920 tokens). Is this the expected behaviour? Why do we not get more space for KV cache given the GPU memory that’s been freed by not using cuda graphs?

RunLLM · September 3, 2025, 12:59pm

Yes, this is expected behavior. Disabling CUDA graphs with --enforce-eager does reduce VRAM usage by not allocating memory for CUDA graphs, but vLLM’s memory profiling and KV cache allocation are based on a profiling run before CUDA graphs are captured. The freed memory from not using CUDA graphs is not reallocated to the KV cache; the allocation logic does not dynamically reclaim and repurpose that space for KV cache after disabling CUDA graphs. This is confirmed in the documentation and discussions, which note that CUDA graph memory is not included in the initial profiling and that the KV cache allocation is fixed at initialization time based on the profiling results, not on actual post-initialization free memory (source, source).

If you want to maximize KV cache size after disabling CUDA graphs, you must manually lower other memory-consuming parameters (like max_model_len or max_num_seqs) and possibly increase gpu_memory_utilization, but vLLM does not automatically reclaim the memory saved from disabling CUDA graphs for KV cache (source). Would you like a step-by-step breakdown of the memory allocation process?

Sources:

Topic		Replies	Views
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	1	20	September 2, 2025
Support for (sparse) key value caching KV-Cache	16	170	May 3, 2025
GLM4.5 V memory leak on inference V1 Feedback	9	33	August 28, 2025
About monitor the usage of KV cache memory General	1	191	May 24, 2025
Disable kv cache storage in vllm General	1	51	August 29, 2025

Don't get more KV Cache when running with --enforce-eager

Related topics