2 vllm containers on a single GPU

This is a known issue with vLLM: when running multiple vLLM instances on the same GPU, the second instance often fails to start due to incorrect GPU memory accounting. vLLM’s memory profiling includes memory already used by other vLLM instances, causing the available memory for the second instance to be underestimated, which leads to errors like “CUDA error: invalid argument” or negative/insufficient KV cache size—even if nvidia-smi shows free VRAM. This behavior is confirmed in multiple reports and is not currently resolved in vLLM, including recent versions (see here, here, and here).

The official documentation and user reports state that, despite the --gpu-memory-utilization flag, vLLM V1 does not allow multiple instances to reliably share a single GPU, because each instance tries to account for all memory, not just its own allocation. There is no supported workaround for running two vLLM containers on the same GPU; the only reliable solution is to run a single vLLM instance per GPU. Would you like more detail on the technical reasons or possible (unsupported) workarounds?

Sources: