I see this is a well discussed topic but still I need some clarification. Here’s the context:
- Nvidia RTX 5090 (32 GB VRAM)
- Model: cyankiwi/Qwen3.6-27B-AWQ-INT4
Problem is that even with short context of 16k I get OOM which surprises me quite a lot. Up to this point the model occupied roughly 24GB (so still plenty of VRAM)
(EngineCore pid=258) INFO 04-24 10:21:03 \[default_loader.py:384\] Loading weights took 0.65 seconds
(EngineCore pid=258) INFO 04-24 10:21:03 \[eagle.py:1377\] Detected MTP model. Sharing target model embedding weights with the draft model.
(EngineCore pid=258) INFO 04-24 10:21:03 \[eagle.py:1433\] Detected MTP model. Sharing target model lm_head weights with the draft model.
(EngineCore pid=258) INFO 04-24 10:21:03 \[gpu_model_runner.py:4820\] Model loading took 18.47 GiB memory and 10.990096 seconds
(EngineCore pid=258) INFO 04-24 10:21:10 \[backends.py:1051\] Using cache directory: /root/.cache/vllm/torch_compile_cache/c7557d5774/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=258) INFO 04-24 10:21:10 \[backends.py:1111\] Dynamo bytecode transform time: 6.25 s
(EngineCore pid=258) INFO 04-24 10:21:12 \[backends.py:372\] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=258) INFO 04-24 10:21:30 \[backends.py:390\] Compiling a graph for compile range (1, 2048) takes 19.40 s
(EngineCore pid=258) INFO 04-24 10:21:32 \[decorators.py:655\] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/55e8fc5c9dc7a6e3388bf81c9bad3c08a99a5b72ec4b8e7bd39253b3baf3d2a0/rank_0_0/model
(EngineCore pid=258) INFO 04-24 10:21:32 \[monitor.py:48\] torch.compile took 28.20 s in total
\[...\]
(EngineCore pid=258) INFO 04-24 10:22:26 \[monitor.py:76\] Initial profiling/warmup run took 54.05 s
(EngineCore pid=258) INFO 04-24 10:22:26 \[backends.py:1051\] Using cache directory: /root/.cache/vllm/torch_compile_cache/c7557d5774/rank_0_0/eagle_head for vLLM's torch.compile
(EngineCore pid=258) INFO 04-24 10:22:26 \[backends.py:1111\] Dynamo bytecode transform time: 0.28 s
(EngineCore pid=258) INFO 04-24 10:22:31 \[backends.py:390\] Compiling a graph for compile range (1, 2048) takes 5.17 s
(EngineCore pid=258) INFO 04-24 10:22:31 \[decorators.py:655\] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/43a9ec07c0db6ac1a1547137937e53c148a5dbc5a085a257bbc2c9d53c5f84a1/rank_0_0/model
(EngineCore pid=258) INFO 04-24 10:22:31 \[monitor.py:48\] torch.compile took 5.52 s in total
(EngineCore pid=258) INFO 04-24 10:22:32 \[monitor.py:76\] Initial profiling/warmup run took 0.38 s
(EngineCore pid=258) WARNING 04-24 10:22:36 \[kv_cache_utils.py:1059\] Add 3 padding layers, may waste at most 6.25% KV cache memory
(EngineCore pid=258) INFO 04-24 10:22:36 \[kv_cache_utils.py:829\] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=258) WARNING 04-24 10:22:36 \[gpu_model_runner.py:6363\] CUDAGraphMode.FULL_AND_PIECEWISE is not supported with spec-decode for attention backend FlashInferBackend (support: AttentionCGSupport.UNIFORM_SINGLE_TOKEN_DECODE); setting cudagraph_mode=PIECEWISE
Then, I get OOM when placing the KV cache:
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] EngineCore failed to start.
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] Traceback (most recent call last):
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] engine_core = EngineCoreProc(\*args, engine_index=dp_rank, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 848, in **init**
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] super().**init**(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 124, in **init**
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] kv_cache_config = self.\_initialize_kv_caches(vllm_config)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 247, in \_initialize_kv_caches
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return self.collective_rpc("determine_available_memory")
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/torch/utils/\_contextlib.py", line 124, in decorate_context
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 381, in determine_available_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] cudagraph_memory_estimate = self.model_runner.profile_cudagraph_memory()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/torch/utils/\_contextlib.py", line 124, in decorate_context
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] return func(\*args, \*\*kwargs)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5864, in profile_cudagraph_memory
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] self.\_init_minimal_kv_cache_for_profiling()
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5815, in \_init_minimal_kv_cache_for_profiling
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] self.initialize_kv_cache(minimal_config)
(EngineCore pid=258) Process EngineCore:
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6797, in initialize_kv_cache
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] kv_caches = self.initialize_kv_cache_tensors(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6718, in initialize_kv_cache_tensors
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] kv_cache_raw_tensors = self.\_allocate_kv_cache_tensors(kv_cache_config)
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6529, in \_allocate_kv_cache_tensors
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] tensor = torch.zeros(
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] ^^^^^^^^^^^^
(EngineCore pid=258) ERROR 04-24 10:22:36 \[core.py:1108\] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.56 GiB. GPU 0 has a total capacity of 31.36 GiB of which 1.37 GiB is free. Including non-PyTorch memory, this process has 29.98 GiB memory in use. Of the allocated memory 24.95 GiB is allocated by PyTorch, and 758.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management ( https://pytorch.org/docs/stable/notes/cuda.html#environment-variables )
These are the parameters I have used to serve the model (docker image “openai:latest”):
--gpu-memory-utilization 0.8
--enable-prefix-caching
--enable-chunked-prefill
--language-model-only
--max-model-len 16384
--kv-cache-dtype fp8
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--reasoning-parser qwen3
--speculative-config '{"method": "mtp","num_speculative_tokens":2}'
Any hint?