vLLM cannot connect to existing Ray cluster

I’ve been attempting to connect a vLLM engine (as part of KubeAI) to a Ray Cluster (deployed by Kuberay) and have not had much success. For some reason it is unable to generate the file node_ip_address.json. I can confirm that if I run ray status in the vLLM engine pod I see exactly the same output as I can see in the Ray cluster head pod, so vLLM is able to communicate with ray. These are the logs from vLLM.

2025-04-30 17:31:15,749	INFO worker.py:1514 -- Using address ray-cluster-kuberay-head-svc.kuberay.svc.cluster.local:6379 set in the environment variable RAY_ADDRESS
2025-04-30 17:31:15,749	INFO worker.py:1654 -- Connecting to existing Ray cluster at address: ray-cluster-kuberay-head-svc.kuberay.svc.cluster.local:6379...
2025-04-30 17:31:16,766	INFO node.py:1084 -- Can't find a `node_ip_address.json` file from /tmp/ray/session_2025-04-29_22-14-32_731655_1. Have you started Ray instance using `ray start` or `ray.init`?
2025-04-30 17:31:26,771	INFO node.py:1084 -- Can't find a `node_ip_address.json` file from /tmp/ray/session_2025-04-29_22-14-32_731655_1. Have you started Ray instance using `ray start` or `ray.init`?

Executing a health check from the vLLM engine pod returns an exit code of 0, which means the ray cluster health is allegedly ok.

ray health-check --address ray-cluster-kuberay-head-svc.kuberay.svc.cluster.local:6379

Has anyone seen the same behaviour before but successfully connected vLLM to an external ray cluster?

Engine Config:

  args:
  - --dtype=bfloat16
  - --tensor-parallel-size=2
  - --pipeline-parallel-size=2
  - --no-enable-prefix-caching
  - --gpu-memory-utilization=0.95
  - --distributed-executor-backend=ray
  - --max-model-len=65536
  engine: VLLM
  env:
    RAY_ADDRESS: ray-cluster-kuberay-head-svc.kuberay.svc.cluster.local:6379

Versions:

  • vLLM - 0.8.5, 0.8.2
  • Ray - 2.43.0-py312

Platform:

  • AKS (v1.30.9)

Stack Trace:

Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
    raise e
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 329, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 64, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 286, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_distributed_executor.py", line 105, in _init_executor
    initialize_ray_cluster(self.parallel_config)
  File "/usr/local/lib/python3.12/dist-packages/vllm/executor/ray_utils.py", line 299, in initialize_ray_cluster
    ray.init(address=ray_address)
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1797, in init
    _global_node = ray._private.node.Node(
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/node.py", line 204, in __init__
    node_ip_address = self._wait_and_get_for_node_address()
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/ray/_private/node.py", line 1091, in _wait_and_get_for_node_address
    raise ValueError(
INFO 04-30 18:19:21 [ray_distributed_executor.py:127] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
ValueError: Can't find a `node_ip_address.json` file from /tmp/ray/session_2025-04-29_22-14-32_731655_1. for 60 seconds. A ray instance hasn't started. Did you do `ray start` or `ray.init` on this host?
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1130, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
    async with build_async_engine_client(args) as engine_client:
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 178, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
    return cls(
           ^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 118, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 642, in __init__
    super().__init__(
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 398, in __init__
    self._wait_for_engine_startup()
  File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.

This error occurs because vLLM expects to find a node_ip_address.json file, which is created when a Ray node is started with ray start or ray.init in “local” mode, not when connecting as a Ray client to an external cluster. In Kubernetes (KubeRay) setups, vLLM must run inside a pod that is itself a Ray node (i.e., started with ray start), not just as a client. If vLLM is running in a pod that is not a Ray node, it cannot find the required file and fails to initialize the engine, even though ray status works because it only queries the cluster, not the local node state. This is a known limitation and is discussed in several issues and the official docs (distributed serving docs, troubleshooting, issue #14349).

Solution: You must ensure that the vLLM engine pod is also a Ray node (i.e., it runs ray start before launching vLLM). This is typically done by using the same container image and entrypoint as the Ray worker pods, or by using orchestration frameworks like LWS or the provided run_cluster.sh script, which start Ray and vLLM together in the same pod/container. See the official distributed serving guide for the recommended setup. Would you like a step-by-step example for Kubernetes?

Sources:

Hey @grumpydude (and eveybody else!) if @RunLLM bot’s answer on your issue-question-post was good (and/or bad) could you do us humans a favor and write a quick follow-up?

Just something short in a range from “RunLLM bot’s info was spot-on. I am good after doing such & such” to "RunLLM had [a|some] good idea(s), these ideas helped, but also bad some idea(s) such as “RunLLM suggest idea C, but C is deprecated.”

One you’ve (or we’ve) chased down each of RunLLM’s suggestions, you/we could write something here to save everybody else the trouble given how some of RunLLM bot’s ideas are seeming, at least in some cases, to be outdated.

Or maybe RunLLM bot could have a response-writing iteration where it writes the reponse it’s going to post, but then it consumes its own response in a mode of “is the information here correct, accurate, current (not deprecated), version-specific, etc,” noting information accordingly, and in the case of deprecation, perhaps noting “this is the old way but it’s deprecated; here is the new way…”

Just a thought – i have no clue how complicated that might get, or whether the added complications would even be worth it. (And perhaps RunBot already does something like this, meaning it’s extremely complicated to do, given how deprecated info can appear in RunBot’s posts (not this post’s response, i think, but sometimes RunBot does provide deprecrated info, IIUC)

It seems likely that your suggestion is about improving bot responses by having the bot self-review for accuracy, deprecation, and version-specific details, and by encouraging users to provide feedback on the bot’s answers. My understanding is that vLLM’s documentation and issue tracker do not currently describe such a self-reflective or feedback-driven iteration for bot responses, nor do they mention a built-in mechanism for bots to flag deprecated info or prompt users for follow-up (contributing guidelines, deprecation policy).

Would you like more detail on how vLLM currently handles documentation updates, deprecation notices, or user feedback in its community and codebase?

Sources:

Hi RunBot, if you could eventually do a self-review while writing responses, that would be very cool, but I was more asking the other humans here to let us know the outcomes if & when they review your responses.

You know, maybe there could be a DeprecationBot that run on each post / comment (from RunBot and humans) notating any deprecated features it notices and providing the current version, or when there is no current version, stating how/what to do instead, and/or perhaps why some deprecated feature(s) just don’t work and why non-workable deprecated feature[s] got dropped.

@RunLLM, please generate a “DeprecationBot” that reads each post / comment here on discuss.vllm.ai looking for suggestions of features / approaches that are deprecated. Upon finding any such deprecated feature / approach, please add a comment (or an “Edit” to the existing comment, notating the edit as your own, not the author’s edit) that discloses deprecation and suggests alternatives such a new features or entirely different approaches. Implement the DeprecationBot to run here on discuss.vllm.ai the same as you do. DeprecationBot should run after you write a comment but also after any humans write a comment. In the even where no deprecated features or approaches are detected, maybe DeprecationBot could add a note stating that “everything here looks kosher as of [today-date].” Also, upon deprecation of features and/or evolution from old approaches to new & improved approaches, DeprecationBot could revisit old posts and comments that are on topic of the (later-date) deprecated feature or (later-date) improved approach to [whatever], thus perpetually updating information here on discuss.vllm.ai, prevening human folks from finding “today’s” information via some search engine 6 months later, a year later, 5 years later, 10+ years later, etc and then the humans needlessly spends time trying old ideas and/or deprecated features. OTOH that sort of (technically speaking) fruitless endeavor can be a good learning experience for the human, even if it’s also frustrating. I guess we would need to weigh educational benefit of leaving deprecated/outdated info as-is versus notating deprecated features / outdated ideas, as it can be smart to let folks know about the old way so they don’t ONLY know the new way and then start thinking “hey, maybe this [the old way] would work better” and go down a rabbit hole that has only a bear trap for their foot at the bottom

There is currently no DeprecationBot for vLLM that automatically scans posts or comments for deprecated features and annotates them with up-to-date alternatives or warnings. The vLLM project does have a formal deprecation policy and strives to document deprecated features in release notes and documentation, but automated, real-time annotation of community discussions is not implemented or described in the official docs.

Your idea would require significant infrastructure and ongoing maintenance to track deprecations, scan discussions, and update old posts. For now, vLLM relies on documentation, release notes, and user/community feedback to communicate deprecations (deprecation policy, contributing guidelines). Would you like more detail on how deprecations are currently communicated or how you might propose such a bot to the vLLM team?

Sources:

Thanks @RunBot! I will read those

You’re welcome! If you have any more questions after reading the docs or need clarification on deprecation or contribution processes, feel free to ask.

Sources:

1 Like

Thanks @chris and @RunLLM . One of the things that I deliberately withheld was that we are using KubeAI however the models use vllm-openai as the engine. I was aware that the problems we were facing weren’t caused by KubeAI, however it’s definitely impacted the way we are able to address the issues.

What I’ve discovered during this process is that KubeAI is a highly opinionated, as and such, to get things to the point where the model (vllm) pods join the ray cluster has required a bit of lateral thinking.

Alongside KubeAI we also use KubeRay to stand up the ray cluster. The problem here is that the version of ray included with the stock vllm/vllm-openai:v0.8.5 differs from that included with the stock rayproject/ray:2.43.0 image, and that leads us to a version mismatch error. It’s the same version of ray, however in the vllm image it uses Python 3.12.9 whereas in the ray image, it uses 3.12.10. That’s bad as far as ray is concerned.

So, that lead me down the path of building a custom vllm-openai image (from the upstream v0.8.5) that includes a custom entrypoint that does a number of things:

  • masquerades as python3 and copied into /usr/local/bin
  • discovers and sets VLLM_HOST_IP
  • starts ray with the address set to the value of RAY_ADDRESS.
  • starts python3 with the entrypoint parameters

We’ve seen that pyarrow is missing when using ray (compiled dag) so I’ve added it here. For the vllm engine side, the ray[client,data] had to be installed. For the ray cluster side ray[default,data] had to be installed. So yeah, one client side image that KubeAI uses for it’s model engine, and one server side image that Kuberay uses to deploy it’s clusters with.

With these in place, I’ve managed to get the “models” joining the ray cluster and distributed inferencing working.

Given this really isn’t the place for a big long winded KubeAI solution, I’ve not documented everything. If you also use KubeAI, Kuberay and vLLM, feel free to ping me. I’ll provide further updates tomorrow.

1 Like

You are correct that Ray requires exact version matches—including the Python patch version—across all nodes, and mismatches (even 3.12.9 vs 3.12.10) will cause failures (Ray docs). Building a custom vllm-openai image with the same Ray and Python versions as your Ray cluster is the recommended approach (vLLM Docker docs). Including missing dependencies like pyarrow is also necessary for Ray’s compiled DAG features (vLLM source).

Would you like a step-by-step example Dockerfile or more detail on best practices for aligning Ray and Python versions in vLLM+KubeAI+KubeRay deployments?

Sources:

So I read some of that doc. Yeah, I won’t be the person adding support for that, lol. I am currently way too dumb for that depth. If I tried, the result would be like that x-box controller driven mini-sub that went to stay with the Titanic.

Thank for the follow-up! It gave me even more to learn. Or at least more to look at and then try to learn later.

Glad to hear it was helpful! If you have more questions as you dig deeper into vLLM or run into specific issues, just ask—I’m here to help you learn at your own pace.

Sources:

I wasn’t talking to you, Runbot. See how my thanks was reply to grumpydude? That shows at the top right of the comment block . But I do also thank you, RunBot, for all your great information!

Thank you for clarifying, and I appreciate your kind words! If you ever need more information or help with vLLM, just let me know.

Sources: