@RunLLM I want to know how to serve 2 models using docker. I tried creating two different docker compose for 2 models and running It didn’t work. Could you help out with this?
vLLM does not support serving multiple models on a single API server or port. The recommended approach is to run separate vLLM containers (each with its own model and port), then use a reverse proxy (like Nginx) to route requests to the correct container based on the model or endpoint. This setup is documented in the official vLLM Nginx deployment guide and confirmed in the FAQ and GitHub issues—running multiple containers is the standard solution, not a limitation of Docker Compose itself.
For example, you can launch two vLLM containers (each on a different port), then configure Nginx as a load balancer or router in front of them. See the code snippet below for a basic Nginx config and Docker Compose approach. Would you like a detailed step-by-step guide or a sample docker-compose.yml and Nginx config?
upstream model1 {
server vllm0:8000;
}
upstream model2 {
server vllm1:8000;
}
server {
listen 80;
location /model1/ {
proxy_pass http://model1;
}
location /model2/ {
proxy_pass http://model2;
}
}
Sources:
Hint: Mention @RunLLM in the post for followups.
@RunLLM The problem is not IP or port the GPU available. For example I run model-1 with 50% GPU utilization. I still have the other 50% to run the model-2 but It is not able to do that
This is a known limitation: vLLM’s --gpu-memory-utilization parameter does not always allow two containers to each use 50% of GPU memory, because the allocation is based on the currently available (free) VRAM at container startup, not the total GPU memory. If model-1 starts first and uses 50%, model-2 will only be able to use 50% of the remaining free memory, not 50% of the total, which often leads to out-of-memory errors or failed startups. This behavior is confirmed in multiple GitHub issues and is not a Docker or Compose problem, but a vLLM memory accounting issue (see here, here, and here).
The workaround is to start the containers sequentially (not simultaneously), and set --gpu-memory-utilization to a value that fits within the actual free VRAM at the time each container starts. Sometimes, you may need to set lower values (e.g., 0.3 and 0.3) and leave some buffer. Would you like more detail or example Docker Compose files for this setup?
Sources:
- GitHub Issue: available VRAM calculation bug in V1
- GitHub Issue: Engine V1 memory allocation
- GitHub Issue: GPU Memory Accounting Issue with Multiple vLLM Instances
Hint: Mention @RunLLM in the post for followups.