Error deploy model with ray

I have two H100 and four A100, i am trying to deploy qwen 2.5 - 7B, but i got this non sense error

(RayWorkerWrapper pid=950) ERROR 08-17 02:52:40 [worker_base.py:619] ValueError: Free memory on device (73.89/93.1 GiB) on startup is less than desired GPU memory utilization (0.9, 83.79 GiB). Decrease GPU memory utilization or reduce GPU memory used by other processes.
(pid=1282, ip=HEAD_NODE_IP) INFO 08-17 02:58:51 [init.py:235] Automatically detected platform cuda.
Traceback (most recent call last):
File “/usr/local/bin/vllm”, line 10, in
sys.exit(main())
^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py”, line 54, in main
args.dispatch_function(args)
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py”, line 52, in cmd
uvloop.run(run_server(args))
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/asyncio/runners.py”, line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “uvloop/loop.pyx”, line 1518, in uvloop.loop.Loop.run_until_complete
File “/usr/local/lib/python3.12/dist-packages/uvloop/init.py”, line 61, in wrapper
return await main
^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1791, in run_server
await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 1811, in run_server_worker
async with build_async_engine_client(args, client_config) as engine_client:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 158, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py”, line 194, in build_async_engine_client_from_engine_args
async_llm = AsyncLLM.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 163, in from_vllm_config
return cls(
^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py”, line 117, in init
self.engine_core = EngineCoreClient.make_async_mp_client(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 98, in make_async_mp_client
return AsyncMPClient(*client_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 677, in init
super().init(
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py”, line 408, in init
with launch_core_engines(vllm_config, executor_class,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3.12/contextlib.py”, line 144, in exit
next(self.gen)
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 697, in launch_core_engines
wait_for_engine_startup(
File “/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py”, line 750, in wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

My Ray status:


Active:
1 node_942272b0ead70054e8f1a2489cc736b6202c050f00ce4a6094da5b11
1 node_4e310ef633848ce78fa4008a0b8b4100d93d60355f6b1b099498018f
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources

Total Usage:
0.0/256.0 CPU
0.0/5.0 GPU
0B/1.58TiB memory
0B/19.46GiB object_store_memory

Total Constraints:
(no request_resources() constraints)
Total Demands:
(no resource demands)

The error “Free memory on device … is less than desired GPU memory utilization … Decrease GPU memory utilization or reduce GPU memory used by other processes” means vLLM is checking that the available GPU memory at startup is less than what you requested via the --gpu-memory-utilization flag. For example, with 73.89 GiB free out of 93.1 GiB, but --gpu-memory-utilization=0.9 (which requests 83.79 GiB), vLLM refuses to start. This is expected behavior: you must either lower --gpu-memory-utilization or ensure no other processes are using GPU memory at startup. See this issue and the docs for details.

To resolve, set --gpu-memory-utilization to a value that fits within your available free memory (e.g., 0.75 if you have 73.89 GiB free on a 93.1 GiB card), or stop other processes using GPU memory. This check is per vLLM instance and does not account for other vLLM or non-vLLM processes, so you must coordinate memory usage across all running jobs. Would you like more detail on how to calculate the correct value or further troubleshooting steps?

Sources:

Hi thanks for replying, but the question is i have H100 with 96GB 2 and A100 80GB * 4

but i am only deploying a 7B model

Even for a 7B model, vLLM pre-allocates GPU memory based on the --gpu-memory-utilization parameter, which defaults to 0.9 (90% of total GPU memory). If there are any other processes using GPU memory, or if the free memory is less than this threshold, vLLM will refuse to start—even if the model itself is small. This is by design and not related to model size; see here and here.

To fix this, either reduce --gpu-memory-utilization to match your available free memory, or ensure no other processes are using GPU memory at startup. The check is strict and applies regardless of model size. Would you like more detail on how to set this parameter or check for background GPU usage?

Sources:

So, for deploying vLLM in a Ray cluster, each GPU in the cluster must have enough available memory based on the configured size (i.e., gpu_memory_utilization * model ratio). Am I correct?

Yes, you are correct. For vLLM in a Ray cluster, each GPU must have at least gpu_memory_utilization * total GPU memory available at startup, regardless of the model size. If the available memory is less than this, vLLM will not start, even for small models. This is a strict per-instance check and does not account for the actual model size or other processes on the GPU. See docs and issue discussion.

Would you like more detail on how to calculate or monitor this in a Ray cluster?

Sources: