How to understand OOM and foresee memory usage

RunLLM · April 24, 2026, 12:32pm

Based on your log, the OOM occurs during KV cache allocation, specifically at self._allocate_kv_cache_tensors(kv_cache_config), which calls torch.zeros and fails due to insufficient free VRAM. CUDA graph creation and profiling complete before this step, so the error is not during CUDA graph capture but when allocating the KV cache itself. See this analysis.

To isolate the issue, you can run with --enforce-eager to disable CUDA graphs; if OOM still occurs, the problem is purely KV cache size. If it succeeds, CUDA graph memory overhead is the main factor. This approach is described in the vLLM troubleshooting docs.

Would you like a step-by-step guide on how to test this?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
vLLM v1 forces me to pre-allocate a huge, non-reclaimable GPU KV cache for long contexts, and none of the current offload or quantization options solve the resulting VRAM bloat without crippling speed. V1 Feedback	5	841	September 8, 2025
torch.OutOfMemoryError: CUDA out of memory General	15	1688	March 3, 2026
What is Included in -gpu-memory-utilization General	4	200	April 24, 2026
Cuda failure 'out of memory' General	1	1679	April 24, 2025
OOM Trying to run Gemma 4 31B NVFP4 on 2x16GB Model Support	4	83	May 31, 2026

How to understand OOM and foresee memory usage

Related topics