torch.OutOfMemoryError: CUDA out of memory

I am trying to run vllm on a RTX 5060-Ti with 16GB VRAM. How to overcome this memory issue?

vllm was installed via:

uv pip install vllm --pre --extra-index-url https://wheels.vllm.ai/nightly.

Transformer was installed via:

uv pip install git+https://github.com/huggingface/transformers.git

To start the vllm server, I ran:

$ echo $PYTORCH_CUDA_ALLOC_CONF
expandable_segments:True

$ vllm serve Qwen/Qwen3.5-9B --port 8000 --tensor-parallel-size 1 --max-model-len 8192 --reasoning-parser qwen

Error message:

(EngineCore_DP0 pid=527157) ERROR 03-03 10:43:03 [gpu_model_runner.py:4331] Failed to load model - not enough GPU memory. Try lowering --gpu-memory-utilization to free memory for weights, increasing --tensor-parallel-size, or using --quantization. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/` for more tips. (original error: CUDA out of memory. Tried to allocate 96.00 MiB. GPU 0 has a total capacity of 15.47 GiB of which 52.94 MiB is free. Including non-PyTorch memory, this process has 14.39 GiB memory in use. Of the allocated memory 14.16 GiB is allocated by PyTorch, and 11.12 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables))`

According to nvidia-smi, the GPU memory is only 9% utilized, i.e. there is 91% of the 16311MiB = 14843 MiB free.

I dont’ understand this statement. Can you explain?

Including non-PyTorch memory, this process has 14.39 GiB memory in use. Of the allocated memory 14.16 GiB is allocated by PyTorch, and 11.12 MiB is reserved by PyTorch but unallocated.

$ nvidia-smi
±----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
±----------------------------------------±-----------------------±---------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5060 Ti     Off |   00000000:01:00.0  On |                  N/A |
|  0%   47C    P5             14W /  180W |    1041MiB /  16311MiB |      9%      Default |
|                                         |                        |                  N/A |
±----------------------------------------±-----------------------±---------------------+


Although this model file size is large, I am able to run it on ollama. So I am think I should be able to run this same model via vllm. 

My first time using vllm.

The error occurs because vLLM pre-allocates most GPU memory for model weights and KV cache, and the Qwen3.5-9B model in BF16/FP16 precision typically requires more than 16GB VRAM (about 18GB+ for weights and cache). The nvidia-smi output shows only 1GB used because vLLM failed to load the model and exited before allocating the full memory block—so the memory usage you see is not representative of what vLLM actually needs to run the model. The error message means PyTorch tried to allocate more memory than was available, including both PyTorch-managed and other CUDA allocations, and failed due to insufficient VRAM. Ollama works because it uses a quantized (e.g., 4-bit) version of the model, which fits in less memory.

To overcome this, you can:

  • Use a quantized model (e.g., 4-bit or 8-bit) compatible with vLLM.
  • Lower the context length (--max-model-len), batch size (--max-num-seqs), or --gpu-memory-utilization.
  • Try CPU offloading or tensor parallelism if you have multiple GPUs, but with a single 16GB GPU, quantization is the most effective solution. See conserving memory guide for details.

Would you like a step-by-step guide on running a quantized model with vLLM?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yes, pls provide a step by step guide. Thanks.

Certainly! Here’s a step-by-step guide to running a quantized (e.g., 4-bit) model with vLLM on your 16GB GPU:

  1. Find a Quantized Model:
    Search for a 4-bit or 8-bit quantized version of your desired model on HuggingFace (look for models with “AWQ”, “GPTQ”, or “int4” in the name, e.g., Qwen/Qwen2.5-7B-Instruct-AWQ).
    Example: Qwen/Qwen2.5-7B-Instruct-AWQ

  2. Start vLLM with Quantization:
    Use the --quantization flag to specify the quantization type (e.g., awq for AWQ models, gptq for GPTQ models).
    Example command:

    vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq --port 8000 --max-model-len 4096
    

    Adjust --max-model-len as needed to fit your GPU.

  3. (Optional) Reduce Memory Further:

    • Lower --max-model-len (context length) or --max-num-seqs (batch size).
    • Set --gpu-memory-utilization to a value like 0.8 if you still get OOM errors.
  4. Test Inference:
    Use the OpenAI-compatible API or vLLM’s Python API to send prompts and verify the model runs.

Would you like a code example for Python inference or more details on quantization types?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Yes, provide both. A code example for Python inference and more details on quantization types.

Python Inference Example:
After starting the vLLM server with a quantized model, you can use the Python API as follows:

from vllm import LLM, SamplingParams

# Use the quantized model name and quantization type
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    quantization="awq"
)

prompts = ["What is the capital of France?"]
sampling_params = SamplingParams(temperature=0.7, max_tokens=64)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print("Prompt:", output.prompt)
    print("Generated:", output.outputs[0].text)

You can also use the OpenAI-compatible API by sending requests to the server’s /v1/completions endpoint.

Quantization Types Supported by vLLM:

  • AWQ: 4-bit weight quantization, good for memory savings and speed (AWQ guide).
  • GPTQ: 4/8-bit quantization, supports dynamic per-module quantization (GPTQ guide).
  • INT4/INT8: General 4-bit or 8-bit quantization, often via llm-compressor.
  • FP8: 8-bit floating point, requires newer GPUs (FP8 guide).
  • bitsandbytes: 4-bit online quantization, but only q4 is supported (bitsandbytes guide).

For a full list and hardware compatibility, see the quantization overview.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM The Qwen/Qwen2.5-7B-Instruct-AWQ started.

(APIServer pid=591196) INFO:     Started server process [591196]
(APIServer pid=591196) INFO:     Waiting for application startup.
(APIServer pid=591196) INFO:     Application startup complete.

Thereafter, no command prompt is return.

I had to open a new terminal, go into the project directory, do source .venv/bin/activate and run the python sample script you had provided me. However, it failed. How do I resolve the errors?

$ python main.py
INFO 03-03 12:24:37 [utils.py:229] non-default args: {‘disable_log_stats’: True, ‘quantization’: ‘awq’, ‘model’: ‘Qwen/Qwen2.5-7B-Instruct-AWQ’}
INFO 03-03 12:24:39 [model.py:530] Resolved architecture: Qwen2ForCausalLM
INFO 03-03 12:24:39 [model.py:1553] Using max model len 32768
INFO 03-03 12:24:40 [awq_marlin.py:166] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
INFO 03-03 12:24:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
Parse safetensors files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 6.56it/s]
INFO 03-03 12:24:41 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=603265) INFO 03-03 12:24:43 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev158+g168ee03e1) with config: model=‘Qwen/Qwen2.5-7B-Instruct-AWQ’, speculative_config=None, tokenizer=‘Qwen/Qwen2.5-7B-Instruct-AWQ’, skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend=‘auto’, disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=‘’, reasoning_parser_plugin=‘’, enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct-AWQ, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={‘level’: None, ‘mode’: <CompilationMode.VLLM_COMPILE: 3>, ‘debug_dump_path’: None, ‘cache_dir’: ‘’, ‘compile_cache_save_format’: ‘binary’, ‘backend’: ‘inductor’, ‘custom_ops’: [‘none’], ‘splitting_ops’: [‘vllm::unified_attention’, ‘vllm::unified_attention_with_output’, ‘vllm::unified_mla_attention’, ‘vllm::unified_mla_attention_with_output’, ‘vllm::mamba_mixer2’, ‘vllm::mamba_mixer’, ‘vllm::short_conv’, ‘vllm::linear_attention’, ‘vllm::plamo2_mamba_mixer’, ‘vllm::gdn_attention_core’, ‘vllm::kda_attention’, ‘vllm::sparse_attn_indexer’, ‘vllm::rocm_aiter_sparse_attn_indexer’, ‘vllm::unified_kv_cache_update’, ‘vllm::unified_mla_kv_cache_update’], ‘compile_mm_encoder’: False, ‘compile_sizes’: , ‘compile_ranges_split_points’: [8192], ‘inductor_compile_config’: {‘enable_auto_functionalized_v2’: False, ‘combo_kernels’: True, ‘benchmark_combo_kernel’: True}, ‘inductor_passes’: {}, ‘cudagraph_mode’: <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, ‘cudagraph_num_of_warmups’: 1, ‘cudagraph_capture_sizes’: [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], ‘cudagraph_copy_inputs’: False, ‘cudagraph_specialize_lora’: True, ‘use_inductor_graph_partition’: False, ‘pass_config’: {‘fuse_norm_quant’: False, ‘fuse_act_quant’: False, ‘fuse_attn_quant’: False, ‘enable_sp’: False, ‘fuse_gemm_comms’: False, ‘fuse_allreduce_rms’: False}, ‘max_cudagraph_capture_size’: 512, ‘dynamic_shapes_config’: {‘type’: <DynamicShapesType.BACKED: ‘backed’>, ‘evaluate_guards’: False, ‘assume_32_bit_indexing’: False}, ‘local_cache_dir’: None, ‘fast_moe_cold_start’: True, ‘static_all_moe_layers’: }
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 1090, in run_engine_core
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 834, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] super().init(
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] vllm_config,
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] …<3 lines>…
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] internal_dp_balancing,
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] )
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py”, line 110, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py”, line 103, in init
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self._init_executor()
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py”, line 48, in _init_executor
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.driver_worker.init_device()
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py”, line 326, in init_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] self.worker.init_device() # type: ignore
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] return func(*args, **kwargs)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py”, line 256, in init_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] current_platform.set_device(self.device)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py”, line 145, in set_device
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] _ = torch.zeros(1, device=device)
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] torch.AcceleratorError: CUDA error: out of memory
(EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Search for cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] Compile with TORCH_USE_CUDA_DSAto enable device-side assertions. (EngineCore_DP0 pid=603265) ERROR 03-03 12:24:43 [core.py:1100] (EngineCore_DP0 pid=603265) Process EngineCore_DP0: (EngineCore_DP0 pid=603265) Traceback (most recent call last): (EngineCore_DP0 pid=603265) File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap (EngineCore_DP0 pid=603265) self.run() (EngineCore_DP0 pid=603265) ~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=603265) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=603265) raise e (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=603265) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 834, in __init__ (EngineCore_DP0 pid=603265) super().__init__( (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~^ (EngineCore_DP0 pid=603265) vllm_config, (EngineCore_DP0 pid=603265) ^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) ...<3 lines>... (EngineCore_DP0 pid=603265) internal_dp_balancing, (EngineCore_DP0 pid=603265) ^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) ) (EngineCore_DP0 pid=603265) ^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 110, in __init__ (EngineCore_DP0 pid=603265) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__ (EngineCore_DP0 pid=603265) self._init_executor() (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor (EngineCore_DP0 pid=603265) self.driver_worker.init_device() (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device (EngineCore_DP0 pid=603265) self.worker.init_device() # type: ignore (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=603265) return func(*args, **kwargs) (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 256, in init_device (EngineCore_DP0 pid=603265) current_platform.set_device(self.device) (EngineCore_DP0 pid=603265) ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ (EngineCore_DP0 pid=603265) File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py", line 145, in set_device (EngineCore_DP0 pid=603265) _ = torch.zeros(1, device=device) (EngineCore_DP0 pid=603265) torch.AcceleratorError: CUDA error: out of memory (EngineCore_DP0 pid=603265) Search forcudaErrorMemoryAllocation’ in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html` for more information. (EngineCore_DP0 pid=603265) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP0 pid=603265) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP0 pid=603265) Compile with TORCH_USE_CUDA_DSA to enable device-side assertions. (EngineCore_DP0 pid=603265) Traceback (most recent call last): File “/home/vllm/project1/main.py”, line 4, in llm = LLM(model=“Qwen/Qwen2.5-7B-Instruct-AWQ”, quantization=“awq”) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/entrypoints/llm.py”, line 373, in **init** self.llm_engine = LLMEngine.from_engine_args( ^ engine_args=engine_args, usage_context=UsageContext.LLM_CLASS ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py”, line 177, in from_engine_args return cls( vllm_config=vllm_config, …<4 lines>… multiprocess_mode=enable_multiprocessing, ) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py”, line 111, in **init** self.engine_core = EngineCoreClient.make_client( ^ multiprocess_mode=multiprocess_mode, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ …<3 lines>… log_stats=self.log_stats, ^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 98, in make_client return SyncMPClient(vllm_config, executor_class, log_stats) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py”, line 178, in sync_wrapper return func(*args, **kwargs) File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 729, in **init** super().**init**( ^ asyncio_mode=False, ^^^^^^^^^^^^^^^^^^^ …<2 lines>… log_stats=log_stats, ^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py”, line 569, in **init** with launch_core_engines( ~^ vllm_config, ^^^^^^^^^^^^ …<2 lines>… addresses, ^^^^^^^^^^ ) as (engine_manager, coordinator, addresses): ^ File “/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/contextlib.py”, line 148, in **exit** next(self.gen) ^^^^^^^^^^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py”, line 951, in launch_core_engines wait_for_engine_startup( ~~~~~~~~~~~~~~~~~~~~~~~^ handshake_socket, ^^^^^^^^^^^^^^^^^ …<6 lines>… coordinator.proc if coordinator else None, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File “/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py”, line 1010, in wait_for_engine_startup raise RuntimeError( …<3 lines>… ) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}`

Your error is a CUDA out-of-memory (OOM) issue: even with a quantized model, vLLM is trying to allocate more GPU memory than is available. This can happen if the default context length (max_model_len) is too high, or if other processes are using GPU memory. The server not returning a prompt is normal—it runs in the foreground and waits for API requests or Python calls.

How to resolve:

  1. Lower max_model_len:
    The default for Qwen2.5-7B-Instruct-AWQ is often 32768, which is too high for a 16GB GPU. Try reducing it to 4096 or 2048:

    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct-AWQ",
        quantization="awq",
        max_model_len=4096
    )
    
  2. Reduce batch size:
    If you use the Python API, also set max_num_seqs=1 to minimize memory usage.

  3. Set gpu_memory_utilization:
    Explicitly set this to a lower value (e.g., 0.8) to avoid OOM:

    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct-AWQ",
        quantization="awq",
        max_model_len=4096,
        gpu_memory_utilization=0.8
    )
    
  4. Check for other GPU processes:
    Run nvidia-smi and kill any other processes using the GPU.

  5. Restart your Python environment:
    Sometimes, zombie processes or memory leaks can cause OOM. Restart your terminal and Python environment.

If you still get OOM, try lowering max_model_len further (e.g., 2048). For more, see the conserving memory guide.

Would you like a full sample script with these settings?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Nothing works.

$ python main.py
INFO 03-03 12:45:16 [utils.py:229] non-default args: {'max_model_len': 2048, 'gpu_memory_utilization': 0.5, 'disable_log_stats': True, 'quantization': 'awq', 'model': 'Qwen/Qwen2.5-7B-Instruct-AWQ'}
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
[2026-03-03 12:45:17] WARNING _http.py:857: Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
INFO 03-03 12:45:18 [model.py:530] Resolved architecture: Qwen2ForCausalLM
INFO 03-03 12:45:18 [model.py:1553] Using max model len 2048
INFO 03-03 12:45:18 [awq_marlin.py:166] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference
INFO 03-03 12:45:18 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
Parse safetensors files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  5.60it/s]
INFO 03-03 12:45:20 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=620583) INFO 03-03 12:45:22 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev158+g168ee03e1) with config: model='Qwen/Qwen2.5-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-7B-Instruct-AWQ, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]         vllm_config,
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]         ^^^^^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ...<3 lines>...
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]         internal_dp_balancing,
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]         ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     )
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]                           ~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     self._init_executor()
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     self.driver_worker.init_device()
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 256, in init_device
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     current_platform.set_device(self.device)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py", line 141, in set_device
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     torch.cuda.set_device(device)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py", line 584, in set_device
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     torch._C._cuda_setDevice(device)
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]     ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] torch.AcceleratorError: CUDA error: out of memory
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=620583) ERROR 03-03 12:45:22 [core.py:1100]
(EngineCore_DP0 pid=620583) Process EngineCore_DP0:
(EngineCore_DP0 pid=620583) Traceback (most recent call last):
(EngineCore_DP0 pid=620583)   File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 313, in _bootstrap
(EngineCore_DP0 pid=620583)     self.run()
(EngineCore_DP0 pid=620583)     ~~~~~~~~^^
(EngineCore_DP0 pid=620583)   File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=620583)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=620583)     raise e
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=620583)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583)     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=620583)     super().__init__(
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~^
(EngineCore_DP0 pid=620583)         vllm_config,
(EngineCore_DP0 pid=620583)         ^^^^^^^^^^^^
(EngineCore_DP0 pid=620583)     ...<3 lines>...
(EngineCore_DP0 pid=620583)         internal_dp_balancing,
(EngineCore_DP0 pid=620583)         ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583)     )
(EngineCore_DP0 pid=620583)     ^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core.py", line 110, in __init__
(EngineCore_DP0 pid=620583)     self.model_executor = executor_class(vllm_config)
(EngineCore_DP0 pid=620583)                           ~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583)     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/abstract.py", line 103, in __init__
(EngineCore_DP0 pid=620583)     self._init_executor()
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/executor/uniproc_executor.py", line 48, in _init_executor
(EngineCore_DP0 pid=620583)     self.driver_worker.init_device()
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/worker_base.py", line 326, in init_device
(EngineCore_DP0 pid=620583)     self.worker.init_device()  # type: ignore
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~~~~~^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=620583)     return func(*args, **kwargs)
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/worker/gpu_worker.py", line 256, in init_device
(EngineCore_DP0 pid=620583)     current_platform.set_device(self.device)
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/platforms/cuda.py", line 141, in set_device
(EngineCore_DP0 pid=620583)     torch.cuda.set_device(device)
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
(EngineCore_DP0 pid=620583)   File "/home/vllm/project1/.venv/lib/python3.13/site-packages/torch/cuda/__init__.py", line 584, in set_device
(EngineCore_DP0 pid=620583)     torch._C._cuda_setDevice(device)
(EngineCore_DP0 pid=620583)     ~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
(EngineCore_DP0 pid=620583) torch.AcceleratorError: CUDA error: out of memory
(EngineCore_DP0 pid=620583) Search for `cudaErrorMemoryAllocation' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP0 pid=620583) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP0 pid=620583) For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP0 pid=620583) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(EngineCore_DP0 pid=620583)
Traceback (most recent call last):
  File "/home/vllm/project1/main.py", line 4, in <module>
    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct-AWQ",
    ...<2 lines>...
        gpu_memory_utilization=0.5,
    )
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/entrypoints/llm.py", line 373, in __init__
    self.llm_engine = LLMEngine.from_engine_args(
                      ~~~~~~~~~~~~~~~~~~~~~~~~~~^
        engine_args=engine_args, usage_context=UsageContext.LLM_CLASS
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py", line 177, in from_engine_args
    return cls(
        vllm_config=vllm_config,
    ...<4 lines>...
        multiprocess_mode=enable_multiprocessing,
    )
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/llm_engine.py", line 111, in __init__
    self.engine_core = EngineCoreClient.make_client(
                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        multiprocess_mode=multiprocess_mode,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
        log_stats=self.log_stats,
        ^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 98, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
    return func(*args, **kwargs)
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 729, in __init__
    super().__init__(
    ~~~~~~~~~~~~~~~~^
        asyncio_mode=False,
        ^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        log_stats=log_stats,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/core_client.py", line 569, in __init__
    with launch_core_engines(
         ~~~~~~~~~~~~~~~~~~~^
        vllm_config,
        ^^^^^^^^^^^^
    ...<2 lines>...
        addresses,
        ^^^^^^^^^^
    ) as (engine_manager, coordinator, addresses):
    ^
  File "/home/master/.local/share/uv/python/cpython-3.13.5-linux-x86_64-gnu/lib/python3.13/contextlib.py", line 148, in __exit__
    next(self.gen)
    ~~~~^^^^^^^^^^
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
    wait_for_engine_startup(
    ~~~~~~~~~~~~~~~~~~~~~~~^
        handshake_socket,
        ^^^^^^^^^^^^^^^^^
    ...<6 lines>...
        coordinator.proc if coordinator else None,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/vllm/project1/.venv/lib/python3.13/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
    raise RuntimeError(
    ...<3 lines>...
    )
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Your GPU is still running out of memory, even with max_model_len=2048 and gpu_memory_utilization=0.5. This suggests that either (1) the model is still too large for your GPU, (2) there are other processes using GPU memory, or (3) vLLM’s memory allocation is hitting a lower bound due to CUDA graphs or other overhead.

Key troubleshooting steps:

  1. Check for other GPU processes:
    Run nvidia-smi and make sure no other processes are using GPU memory. If you see any, kill them (e.g., kill -9 <pid>).

  2. Try enforce_eager mode:
    CUDA graphs can consume extra memory. Add enforce_eager=True to your LLM constructor:

    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct-AWQ",
        quantization="awq",
        max_model_len=2048,
        gpu_memory_utilization=0.5,
        enforce_eager=True
    )
    

    This disables CUDA graphs and can reduce memory usage (see docs).

  3. Lower max_model_len further:
    Try max_model_len=1024 or even 512. Some users report that only very small context lengths work on 16GB cards with 7B models (see issue).

  4. Try a smaller model:
    If the above fails, try a 3B or 1.5B model. Even quantized, some 7B models may not fit on consumer 16GB GPUs with vLLM (see analysis).

  5. Restart your environment:
    Fully restart your Python environment and shell to clear any zombie processes or memory leaks.

If none of these work, your GPU may simply not have enough VRAM for this model with vLLM’s memory requirements, even in quantized form.

Would you like a minimal working example for a smaller model?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM How do I terminate the vllm server?

To terminate the vLLM server, simply press Ctrl+C in the terminal where the server is running. If the process does not stop, you can find its process ID using ps or nvidia-smi and kill it with kill -9 <pid> as a last resort. There is no official shutdown API or command in vLLM as of now.
See: How to shutdown vllm server, How to terminate vllm completely?

Would you like more detail or example commands?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM Thanks. Done. It is ironic. Although vllm is known for optimizing the use of a GPU when running AI models, it can’t run the same AI model that ollama is able to run on the same GPU. Note, the ollama server was already stopped and the GPU had 98% available memory. It suggest to me that one of the weakness of vllm is that it is resource heavy at the front end.

You are correct: vLLM aggressively pre-allocates GPU memory for model weights and KV cache to maximize throughput and batching, which often results in higher memory requirements than Ollama, especially for consumer GPUs. Ollama uses more aggressive quantization and streaming, allowing it to run larger models in less memory, but at the cost of lower throughput and sometimes slower response (see discussion, conserving memory docs).

This is a known tradeoff: vLLM is optimized for high concurrency and server workloads, not minimal memory usage. For edge or low-memory scenarios, Ollama or similar tools may be more suitable.

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM What kind of GPUs is suited to run vllm? Would having 2 RTX 5060-Ti 16GB GPUs resolve the current limitation?

vLLM is best suited for high-memory, data center GPUs such as NVIDIA A100, H100, H200, B200, or enterprise RTX cards with 24GB+ VRAM. Consumer GPUs with 16GB VRAM (like RTX 5060-Ti) are often insufficient for 7B+ models, even with quantization, due to vLLM’s memory allocation strategy (conserving memory guide, hardware recommendations).

Adding a second RTX 5060-Ti (16GB) and using --tensor-parallel-size 2 can help by splitting the model across both GPUs, effectively doubling available memory for model weights and cache. However, this only works if the model supports tensor parallelism and both GPUs are available and compatible (parallelism docs). Some users have reported issues with tensor parallelism on consumer cards, so success may vary.

Would you like a step-by-step guide for multi-GPU setup?

Sources:

Hint: Mention @RunLLM in the post for followups.