Based on your log, the OOM occurs during KV cache allocation, specifically at self._allocate_kv_cache_tensors(kv_cache_config), which calls torch.zeros and fails due to insufficient free VRAM. CUDA graph creation and profiling complete before this step, so the error is not during CUDA graph capture but when allocating the KV cache itself. See this analysis.
To isolate the issue, you can run with --enforce-eager to disable CUDA graphs; if OOM still occurs, the problem is purely KV cache size. If it succeeds, CUDA graph memory overhead is the main factor. This approach is described in the vLLM troubleshooting docs.
Would you like a step-by-step guide on how to test this?
Sources:
Hint: Mention @RunLLM in the post for followups.