How to understand OOM and foresee memory usage

I see this is a well discussed topic but still I need some clarification. Here’s the context:

  • Nvidia RTX 5090 (32 GB VRAM)
  • Model: cyankiwi/Qwen3.6-27B-AWQ-INT4

Problem is that even with short context of 16k I get OOM which surprises me quite a lot. Up to this point the model occupied roughly 24GB (so still plenty of VRAM)

(EngineCore pid=258) INFO 04-24 10:21:03 \[default_loader.py:384\] Loading weights took 0.65 seconds
(EngineCore pid=258) INFO 04-24 10:21:03 \[eagle.py:1377\] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=258) INFO 04-24 10:21:03 \[eagle.py:1433\] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=258) INFO 04-24 10:21:03 \[gpu_model_runner.py:4820\] Model loading took 18.47 GiB memory and 10.990096 seconds
(EngineCore pid=258) INFO 04-24 10:21:10 \[backends.py:1051\] Using cache directory: /root/.cache/vllm/torch_compile_cache/c7557d5774/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=258) INFO 04-24 10:21:10 \[backends.py:1111\] Dynamo bytecode transform time: 6.25 s
(EngineCore pid=258) INFO 04-24 10:21:12 \[backends.py:372\] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=258) INFO 04-24 10:21:30 \[backends.py:390\] Compiling a graph for compile range (1, 2048) takes 19.40 s
(EngineCore pid=258) INFO 04-24 10:21:32 \[decorators.py:655\] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/55e8fc5c9dc7a6e3388bf81c9bad3c08a99a5b72ec4b8e7bd39253b3baf3d2a0/rank_0_0/model
(EngineCore pid=258) INFO 04-24 10:21:32 \[monitor.py:48\] torch.compile took 28.20 s in total
\[...\]
(EngineCore pid=258) INFO 04-24 10:22:26 \[monitor.py:76\] Initial profiling/warmup run took 54.05 s
(EngineCore pid=258) INFO 04-24 10:22:26 \[backends.py:1051\] Using cache directory: /root/.cache/vllm/torch_compile_cache/c7557d5774/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=258) INFO 04-24 10:22:26 \[backends.py:1111\] Dynamo bytecode transform time: 0.28 s
(EngineCore pid=258) INFO 04-24 10:22:31 \[backends.py:390\] Compiling a graph for compile range (1, 2048) takes 5.17 s
(EngineCore pid=258) INFO 04-24 10:22:31 \[decorators.py:655\] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/43a9ec07c0db6ac1a1547137937e53c148a5dbc5a085a257bbc2c9d53c5f84a1/rank_0_0/model
(EngineCore pid=258) INFO 04-24 10:22:31 \[monitor.py:48\] torch.compile took 5.52 s in total
(EngineCore pid=258) INFO 04-24 10:22:32 \[monitor.py:76\] Initial profiling/warmup run took 0.38 s
(EngineCore pid=258) WARNING 04-24 10:22:36 \[kv_cache_utils.py:1059\] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=258) INFO 04-24 10:22:36 \[kv_cache_utils.py:829\] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=258) WARNING 04-24 10:22:36 \[gpu_model_runner.py:6363\] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE

Then, I get OOM when placing the KV cache:

(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] EngineCore failed to start.
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] Traceback (most recent call last):
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     engine_core = EngineCoreProc(\*args, engine_index=dp_rank, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in **init**
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     super().**init**(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in **init**
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     kv_cache_config = self.\_initialize_kv_caches(vllm_config)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in \_initialize_kv_caches
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/\_contextlib.py", line 124, in decorate_context
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/\_contextlib.py", line 124, in decorate_context
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     self.\_init_minimal_kv_cache_for_profiling()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5815, in \_init_minimal_kv_cache_for_profiling
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     self.initialize_kv_cache(minimal_config)
(EngineCore pid=258) Process EngineCore:
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6797, in initialize_kv_cache
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     kv_caches = self.initialize_kv_cache_tensors(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6718, in initialize_kv_cache_tensors
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     kv_cache_raw_tensors = self.\_allocate_kv_cache_tensors(kv_cache_config)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6529, in \_allocate_kv_cache_tensors
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]     tensor = torch.zeros(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\]              ^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.56 GiB. GPU 0 has a total capacity of 31.36 GiB of which 1.37 GiB is free. Including non-PyTorch memory, this process has 29.98 GiB memory in use. Of the allocated memory 24.95 GiB is allocated by PyTorch, and 758.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables )

These are the parameters I have used to serve the model (docker image “openai:latest”):

        --gpu-memory-utilization 0.8
        --enable-prefix-caching
        --enable-chunked-prefill
        --language-model-only
        --max-model-len 16384
        --kv-cache-dtype fp8
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --reasoning-parser qwen3
        --speculative-config '{"method": "mtp","num_speculative_tokens":2}'

Any hint?

Your OOM is likely due to vLLM’s memory profiling and CUDA graph capture, which reserve extra VRAM beyond model weights, especially with speculative decoding and prefix caching enabled. Even with 24GB used for weights, the remaining VRAM is consumed by activations, CUDA graphs, and the KV cache, which can be substantial for 16k context and 27B models. The error shows only 1.37 GiB free when allocating a 1.56 GiB KV cache block, so the allocation fails despite apparent headroom. Lowering --gpu-memory-utilization or disabling speculative decoding may help; also, try reducing --max-model-len or disabling prefix caching to reclaim memory. See this discussion and conserving memory docs for more details.

If you want to maximize available memory for KV cache, try --gpu-memory-utilization 0.95 and/or temporarily disable speculative decoding and prefix caching to see if the model starts. If not, reduce --max-model-len further until it fits. The strict upfront memory check in vLLM V1 means all these factors (including CUDA graphs) are reserved before serving begins, which can surprise users with large models and moderate context lengths.

Would you like a step-by-step breakdown of how vLLM calculates available memory and why these features impact VRAM so much?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM

It seems to me that torch.compile is completing fine (see line (EngineCore pid=258) INFO 04-24 10:22:31 [monitor.py:48] torch.compile took 5.52 s in total, it probably had cache to use).

It’s not clear if I need to lower even more (now it was 80%) or if you suggest to raise to (95%). Model length is 16k, which is quite low already. Moreover, it is quantized as well.

Even with quantization and 16k context, vLLM’s memory profiling reserves extra VRAM for CUDA graphs, activations, and the KV cache, which can cause OOM at 80% utilization. Sometimes, increasing --gpu-memory-utilization (e.g., to 0.95) allows more VRAM for the KV cache, but if fragmentation or other allocations exist, lowering it (e.g., to 0.7) may help. The optimal value depends on your system’s actual free VRAM and fragmentation at allocation time, not just the model size or quantization. See this issue and memory utilization docs for details.

Would you like a step-by-step explanation of how to tune gpu_memory_utilization and interpret vLLM’s memory logs?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM From the log I have posted can you state if the issue is in the KV cache or in the CUDA graph creation? Is there a way to exclude one or the other from the equation?

Based on your log, the OOM occurs during KV cache allocation, specifically at self._allocate_kv_cache_tensors(kv_cache_config), which calls torch.zeros and fails due to insufficient free VRAM. CUDA graph creation and profiling complete before this step, so the error is not during CUDA graph capture but when allocating the KV cache itself. See this analysis.

To isolate the issue, you can run with --enforce-eager to disable CUDA graphs; if OOM still occurs, the problem is purely KV cache size. If it succeeds, CUDA graph memory overhead is the main factor. This approach is described in the vLLM troubleshooting docs.

Would you like a step-by-step guide on how to test this?

Sources:

Hint: Mention @RunLLM in the post for followups.