Does vllm support deploy multiple docker instance on one gpu

I have an Nvidia H100, I want to deploy multiple small model on this GPU.
I deploy an qwen2.5-omni-7b with gpu_memory_utilization = 0.5.
I try to deploy another qwen3-4b with gpu_memory_utilization = 0.4. Error occures:

ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine.

If I stop qwen2.5-omni-7b and only deploy qwen3-4b, gpu_memory_utilization = 0.25 is enough

This error occurs because vLLM’s gpu_memory_utilization parameter is a per-instance limit, but the memory calculation does not account for memory already used by other vLLM instances on the same GPU. As a result, if you start one model and then another, the second instance may see less available memory than expected, leading to the “No available memory for the cache blocks” error—even if the sum of gpu_memory_utilization values is less than 1.0. This is a known issue, and the workaround is to stagger the startup of the services (wait 30 seconds between launches) or increase the gpu_memory_utilization for the second instance to account for memory already used by the first instance. See the discussion and workaround in this issue.

If you stop the first model, the second can start with a lower gpu_memory_utilization because all GPU memory is available. When running multiple vLLM instances, you may need to set a higher gpu_memory_utilization for the later-started instance, or experiment with staggered startups, as described above. Would you like more detail on how to tune these parameters or on the underlying memory allocation behavior?

Sources: