How to serve two vLLM instance using docker?

Ranjith · February 26, 2026, 12:15pm

@RunLLM I want to know how to serve 2 models using docker. I tried creating two different docker compose for 2 models and running It didn’t work. Could you help out with this?

RunLLM · February 26, 2026, 12:15pm

vLLM does not support serving multiple models on a single API server or port. The recommended approach is to run separate vLLM containers (each with its own model and port), then use a reverse proxy (like Nginx) to route requests to the correct container based on the model or endpoint. This setup is documented in the official vLLM Nginx deployment guide and confirmed in the FAQ and GitHub issues—running multiple containers is the standard solution, not a limitation of Docker Compose itself.

For example, you can launch two vLLM containers (each on a different port), then configure Nginx as a load balancer or router in front of them. See the code snippet below for a basic Nginx config and Docker Compose approach. Would you like a detailed step-by-step guide or a sample docker-compose.yml and Nginx config?

upstream model1 {
    server vllm0:8000;
}
upstream model2 {
    server vllm1:8000;
}
server {
    listen 80;
    location /model1/ {
        proxy_pass http://model1;
    }
    location /model2/ {
        proxy_pass http://model2;
    }
}

Sources:

Hint: Mention @RunLLM in the post for followups.

Ranjith · February 26, 2026, 12:17pm

@RunLLM The problem is not IP or port the GPU available. For example I run model-1 with 50% GPU utilization. I still have the other 50% to run the model-2 but It is not able to do that

RunLLM · February 26, 2026, 12:17pm

This is a known limitation: vLLM’s --gpu-memory-utilization parameter does not always allow two containers to each use 50% of GPU memory, because the allocation is based on the currently available (free) VRAM at container startup, not the total GPU memory. If model-1 starts first and uses 50%, model-2 will only be able to use 50% of the remaining free memory, not 50% of the total, which often leads to out-of-memory errors or failed startups. This behavior is confirmed in multiple GitHub issues and is not a Docker or Compose problem, but a vLLM memory accounting issue (see here, here, and here).

The workaround is to start the containers sequentially (not simultaneously), and set --gpu-memory-utilization to a value that fits within the actual free VRAM at the time each container starts. Sometimes, you may need to set lower values (e.g., 0.3 and 0.3) and leave some buffer. Would you like more detail or example Docker Compose files for this setup?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
2 vllm containers on a single GPU General	3	1979	August 27, 2025
Run multiple models General	7	11074	July 28, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	927	May 21, 2025
2 vllm docker on same host General	3	315	September 16, 2025
Need to serve a Qwen3 LLM with 235B params General	9	446	March 7, 2026

How to serve two vLLM instance using docker?

Related topics