2 vllm containers on a single GPU

RunLLM · May 13, 2025, 10:07am

This is a known issue with vLLM: when running multiple vLLM instances on the same GPU, the second instance often fails to start due to incorrect GPU memory accounting. vLLM’s memory profiling includes memory already used by other vLLM instances, causing the available memory for the second instance to be underestimated, which leads to errors like “CUDA error: invalid argument” or negative/insufficient KV cache size—even if nvidia-smi shows free VRAM. This behavior is confirmed in multiple reports and is not currently resolved in vLLM, including recent versions (see here, here, and here).

The official documentation and user reports state that, despite the --gpu-memory-utilization flag, vLLM V1 does not allow multiple instances to reliably share a single GPU, because each instance tries to account for all memory, not just its own allocation. There is no supported workaround for running two vLLM containers on the same GPU; the only reliable solution is to run a single vLLM instance per GPU. Would you like more detail on the technical reasons or possible (unsupported) workarounds?

Sources:

Topic		Replies	Views
Does vllm support deploy multiple docker instance on one gpu General	1	924	May 21, 2025
An issue about using multiple GPU to deploy multiple models with vllm General	1	58	May 28, 2026
2 vllm docker on same host General	3	307	September 16, 2025
How to serve two vLLM instance using docker? General	3	560	February 26, 2026
Multiple vLLM Engine Deployment problem on NVIDIA's SM Engines General	1	302	October 2, 2025

2 vllm containers on a single GPU

Related topics