Run multiple models

youlearnit · July 25, 2025, 11:55pm

How to run multiple models ?
How to define the other models in the vllm server command?

RunLLM · July 25, 2025, 11:55pm

vLLM does not support serving multiple models in a single server process or defining multiple models in the vLLM server command. To run multiple models, you must launch separate vLLM server instances, each on a different port and (optionally) GPU, and use an external router or load balancer (e.g., Nginx) to direct requests to the appropriate server based on the model requested. There is no command-line option to define multiple models in a single vLLM server process, and this is confirmed as not supported in the official documentation and issues.

For example, to serve two models, you would run:

CUDA_VISIBLE_DEVICES=0 vllm serve --model modelA --port 8000
CUDA_VISIBLE_DEVICES=1 vllm serve --model modelB --port 8001

Then, use a load balancer to route requests to the correct port. For more details and Nginx configuration, see the linked sources. Would you like more detail on setting up the load balancer or routing logic?
Sources:

youlearnit · July 25, 2025, 11:58pm

When having tensor parallel, and running multiple vllm servers offering multiple models do I have to define the memory usage in each like 50%?

RunLLM · July 25, 2025, 11:58pm

Yes, when running multiple vLLM servers (each serving a different model) with tensor parallelism on the same GPU(s), you should set the --gpu-memory-utilization parameter for each instance so that their total does not exceed 1.0 (100%). For example, if you run two servers, you might set --gpu-memory-utilization=0.5 for each. This ensures each server only uses its allocated fraction of GPU memory and prevents out-of-memory errors. This applies per instance and is required for stable multi-model, multi-server deployments with tensor parallelism or otherwise, as confirmed in the documentation and issues.

Would you like more detail on how to calculate or set these values in practice?

Sources:

youlearnit · July 26, 2025, 12:00am

yes , how to calculate the memory

RunLLM · July 26, 2025, 12:00am

There is no exact formula to calculate the required --gpu-memory-utilization for each vLLM instance, because vLLM profiles memory usage empirically at startup, accounting for model weights, KV cache, CUDA graphs, and other runtime allocations. The recommended approach is to estimate the sum of model weights and expected KV cache, then add a margin for overhead, but the only reliable way is to start vLLM with a conservative value (e.g., 0.4 or 0.5), check for OOM errors, and adjust as needed. The vLLM team notes that memory usage can exceed the set limit due to fragmentation, CUDA graphs, and other factors, so always leave some headroom and avoid setting the sum of all instances to exactly 1.0.

Would you like a step-by-step example of how to estimate and tune this value for your setup?

Sources:

KubedOut · July 28, 2025, 9:33am

I would like to know why /v1/models endpoint is given in the api endpoints, if it’s not possible to serve multiple models.

RunLLM · July 28, 2025, 9:33am

The /v1/models endpoint exists to comply with the OpenAI API specification, which requires a models listing endpoint. In vLLM, this endpoint returns the single model (or base model plus any LoRA adapters) currently served by that instance, not multiple independent models. It does not indicate support for serving multiple models in one server process—this is confirmed in the documentation and issues.

Would you like more detail on how this endpoint behaves with LoRA adapters or custom model names?

Sources:

Topic		Replies	Views
Understanding Multi Node Parallelization General	7	116	May 13, 2026
How to serve two vLLM instance using docker? General	3	463	February 26, 2026
"vllm serve" to use all GPU's of a single-node multi-use General	5	916	August 1, 2025
Does vLLM support multiple model_executor? Scheduling	1	362	April 28, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	907	May 21, 2025

Run multiple models

Related topics