When I run vLLM with –enforce-eager
, I’m expecting less vram usage. And I no longer see the log message about the memory cuda graphs use (0.53GB in my case). But this doesn’t lead to higher KV cache size. The cache size is exactly the same in both cases (241,920 tokens). Is this the expected behaviour? Why do we not get more space for KV cache given the GPU memory that’s been freed by not using cuda graphs?
Yes, this is expected behavior. Disabling CUDA graphs with --enforce-eager
does reduce VRAM usage by not allocating memory for CUDA graphs, but vLLM’s memory profiling and KV cache allocation are based on a profiling run before CUDA graphs are captured. The freed memory from not using CUDA graphs is not reallocated to the KV cache; the allocation logic does not dynamically reclaim and repurpose that space for KV cache after disabling CUDA graphs. This is confirmed in the documentation and discussions, which note that CUDA graph memory is not included in the initial profiling and that the KV cache allocation is fixed at initialization time based on the profiling results, not on actual post-initialization free memory (source, source).
If you want to maximize KV cache size after disabling CUDA graphs, you must manually lower other memory-consuming parameters (like max_model_len
or max_num_seqs
) and possibly increase gpu_memory_utilization
, but vLLM does not automatically reclaim the memory saved from disabling CUDA graphs for KV cache (source). Would you like a step-by-step breakdown of the memory allocation process?
Sources: