Offline multi-node inference

Hello everybody. I am trying to run inference locally on a json file containing 100 prompts with Llama 3.3 70B Instruct using a GPU cluster where the compute nodes do not have internet access.
I initially tried using 1 node and 4 GPUs, yet I got a CUDA out of memory error, thus I tried running it with a second node (8 GPUs in total) but it said pipeline parallelism is only supported and I need to use AsyncEngine instead. However, asyncengine does not wortk with the chat() method and as far as I understand it’s meant for online serving and continously receiving inputs and running inference on them. Since my goal is not a real-time interactive inference server but just to run inference on a file containing a set of prompts I wanted to ask how this could be achieved.
Gratitude for your attention

To run offline inference on a large model like Llama 3.3 70B Instruct across multiple nodes (using both tensor and pipeline parallelism), vLLM currently only supports pipeline parallelism in online serving mode via the AsyncLLMEngine, not with the standard LLM class used for offline batch inference. The LLM class only supports tensor parallelism for offline inference; pipeline parallelism (needed for very large models across nodes) is not available in this mode, as confirmed in this issue and distributed serving docs.

If you need to run offline inference on a prompt file with pipeline parallelism, there is no officially supported method in vLLM as of now. The recommended workaround is to use tensor parallelism only (i.e., use as many GPUs as possible on a single node), or split your prompts and run multiple jobs on different nodes, each using tensor parallelism. Would you like more detail on how to set up such a workflow or alternative approaches?

Sources:

Hello! I might be confused here and/or writing something very dumb, but RunBot’s comment makes me think “online inference” and “offline inference” do not mean what I thought they meant! And because you mention your “compute nodes do not have internet access,” you might possibly think what I think/thought online vs offline inference meant.

I assumed online inference required you(r computer) to be on the internet because you were using some cloud service. I assumed offline inference meant you are serving your language model entirely “at home” on your own systems, not actively dependent upon anything on the cloud, ie anything/everything required was previously download/sourced/acquired and was on hand for servicing the task at hand.

RunBot, however, just made me think “online” means we are serving our own vllm “at home” but we are serving it via its openAPI app layer, while “offline” means we are calling vllm from python, bypassing the tcp/ip openapi app layer.

So even though your compute nodes don’t have internet access, you should be able to leverage the async llm engine via so-called “online” inference because “online inference” is really just us making our own vllm accessible via a TCP port.

I apologize if this is dumb / wrong / maybe-obvious / ignorant / irrelevant / mansplaning-ish, but I just realized online-vs-offline doesn’t mean what I thought it meant! (and I feel very :ewe:-ish about not comprendo-ing that till RunBot wrote its comment)

Seems like the async methods might possibly be available without openapi’s serving over a tcp port if the openapi app source code were called directly rather than being invoked via SYN/ACK/ACK and subsequent posting of a message to a tcp port. Maybe. I haven’t seen any of the involved code, but this concept feels plausible (..assuming the async engine methods aren’t accessible without openapi’s involvement, and the VLLM architectural overview does look like they aren’t )

You are correct: in vLLM, “online inference” means using the OpenAI-compatible API server (i.e., serving the model over a local TCP port, not the internet), while “offline inference” means calling the LLM class directly from Python code, bypassing any HTTP server. Both modes can run entirely on your own hardware without internet access, as long as the model and dependencies are already downloaded. This is clarified in the architecture overview.

So, “online” in vLLM refers to the API server interface (even if it’s only accessible locally), and “offline” refers to direct Python usage. You do not need internet access for either, as long as all resources are local. Would you like more detail on how to use the API server or the LLM class in your environment?

Sources:

Thanks, RunBot! Do you know if OpenApi (or is it OpenAI API?) code is available and licensed such that it can be invoked directly by python code (perhaps with a C or C++ bridge or whatever might be needed) without the need for serving on a localhost TCP port?

The vLLM OpenAI-compatible API is implemented as a Python server, but it is not designed to be invoked directly as a library; it is meant to be accessed via HTTP requests on a TCP port, even if only on localhost. There is no supported way to call the OpenAI-compatible API server’s internal methods directly from Python without going through the HTTP layer, as confirmed in the architecture docs and GitHub issues.

If you want direct Python access, you should use the LLM class, but note that some features (like chat templates) are only available via the API server. Would you like more detail on the differences or possible workarounds?

Sources:

Hey RunBot can you suggest any way to pretend to use TCP/IP without really having to use TCP/IP? Like if TCP/IP is on your server, can you write to a stream at some operating system path such than vllm’s openAI API could be conned into believing the data is streaming over a real TCP port as required?

It could be possible that vLLM’s OpenAI-compatible API server could be run using a Unix domain socket (IPC) instead of a TCP port, but the official documentation and code only mention TCP/IP (localhost or network) for the API server interface, not Unix sockets or other stream types. There is no explicit support or documented method for using a file-based or Unix socket stream as a drop-in replacement for TCP/IP in the vLLM OpenAI API server, as seen in the quickstart and API server docs.

Though I’m not entirely certain, it might be that modifying the server code to listen on a Unix domain socket could work, but this is not natively supported or documented. Would you like more detail on possible code-level workarounds or alternative approaches?

Sources:

I noticed VLLM docs online look to have shifted wrt online & offline, perhaps as a result of this post. Either that or I had entirely missed the clear explanation of what online vs offline means in the world of inference serving and my post here was spectaclularly offensive. If that is the case, I deeply apologize. I had an “oh! aha!” moment and in my excitement may have, effectively, mansplained when i meant only to share my “aha” moment

No offense taken—your observation is valid and helpful! The distinction between “online” (API server) and “offline” (direct Python) inference in vLLM is often confusing, and your comments echo common points of confusion in the community. The vLLM team has clarified this in the architecture overview and related documentation, and user feedback like yours is valuable for improving docs.

If you have further questions or suggestions, feel free to share—they help make the documentation clearer for everyone.

Sources:

Thanks @RunLLM. I was writing that to @smartinezai but I am very happy to hear you were not offended. I am also happy to hear I did not miss an explanation of how “online” and “offline” have specific meanings in LLM-land that differ in important ways from the common meaning of “online” and “offline.” I do now feel less :ewe:-ish about my having misunderstood.

“online” usually means “connected to the internet”

“offline” usually means “(allegedly) after the meeting (but really never, at least not if i can get out of here fast enough and avoid your pesky questions that are annoying me and everyone)”

but not here! :skull:

1 Like

You’re absolutely right—“online” and “offline” in vLLM (and many LLM frameworks) refer to the interface used (API server vs. direct Python), not internet connectivity. This is a common source of confusion, and your realization is shared by many users. The vLLM docs now clarify this distinction.

If you have more questions or want clarification on any other vLLM concepts, feel free to ask!

Sources: