How to run multiple models ?
How to define the other models in the vllm server command?
vLLM does not support serving multiple models in a single server process or defining multiple models in the vLLM server command. To run multiple models, you must launch separate vLLM server instances, each on a different port and (optionally) GPU, and use an external router or load balancer (e.g., Nginx) to direct requests to the appropriate server based on the model requested. There is no command-line option to define multiple models in a single vLLM server process, and this is confirmed as not supported in the official documentation and issues.
For example, to serve two models, you would run:
CUDA_VISIBLE_DEVICES=0 vllm serve --model modelA --port 8000
CUDA_VISIBLE_DEVICES=1 vllm serve --model modelB --port 8001
Then, use a load balancer to route requests to the correct port. For more details and Nginx configuration, see the linked sources. Would you like more detail on setting up the load balancer or routing logic?
Sources:
- vLLM FAQ
- GitHub Issue: How to deploy multiple models in openai api server and specify different gpu for each model?
- GitHub Issue: How to serve embedding model and LLM at the same time
- GitHub Issue: Can vllm serving clients by using multiple model instances?
- Nginx Load Balancer Example
- GitHub Issue: Multiple Models on Same Port
When having tensor parallel, and running multiple vllm servers offering multiple models do I have to define the memory usage in each like 50%?
Yes, when running multiple vLLM servers (each serving a different model) with tensor parallelism on the same GPU(s), you should set the --gpu-memory-utilization parameter for each instance so that their total does not exceed 1.0 (100%). For example, if you run two servers, you might set --gpu-memory-utilization=0.5 for each. This ensures each server only uses its allocated fraction of GPU memory and prevents out-of-memory errors. This applies per instance and is required for stable multi-model, multi-server deployments with tensor parallelism or otherwise, as confirmed in the documentation and issues.
Would you like more detail on how to calculate or set these values in practice?
Sources:
- vLLM Engine Args: --gpu-memory-utilization
- GitHub Issue: Engine V1 When loading two models into the same GPU the second model requires more memory allocation than the first
- GitHub Issue: Have you considered sharing KV cache of a single GPU across multiple models?
- GitHub Issue: Support multiple models per GPU
yes , how to calculate the memory
There is no exact formula to calculate the required --gpu-memory-utilization for each vLLM instance, because vLLM profiles memory usage empirically at startup, accounting for model weights, KV cache, CUDA graphs, and other runtime allocations. The recommended approach is to estimate the sum of model weights and expected KV cache, then add a margin for overhead, but the only reliable way is to start vLLM with a conservative value (e.g., 0.4 or 0.5), check for OOM errors, and adjust as needed. The vLLM team notes that memory usage can exceed the set limit due to fragmentation, CUDA graphs, and other factors, so always leave some headroom and avoid setting the sum of all instances to exactly 1.0.
Would you like a step-by-step example of how to estimate and tune this value for your setup?
Sources:
I would like to know why /v1/models endpoint is given in the api endpoints, if it’s not possible to serve multiple models.
The /v1/models endpoint exists to comply with the OpenAI API specification, which requires a models listing endpoint. In vLLM, this endpoint returns the single model (or base model plus any LoRA adapters) currently served by that instance, not multiple independent models. It does not indicate support for serving multiple models in one server process—this is confirmed in the documentation and issues.
Would you like more detail on how this endpoint behaves with LoRA adapters or custom model names?
Sources: