Hello, I wanted to ask for guidance on deploying multiple vLLM Engine services on a single GPU.
Objective:
I wanted to deploy 2 vLLM Engines of which would 19GB of GPU Memory on NVIDIA A100.
What I have tried:
Deploying 2 vLLM Engines on the same container that has 1 CUDA GPU in it, by tweaking the gpu_memory_utilization parameter. → Both services can be used normally. However, I wanted to containerise each instance.
So then I tried using NVIDIA MIG. I am using 4g.20gb profile that would lead me to have 20GB of GPU for each instance. However, I kept running into a crash that told me that vLLM would need at least 80ish SM Engine for each instance. While a single GPU have 98, the MIG profile that I’m using only has 56 SMs
My questions:
I am unfamiliar with the hardware, so what’s the use of SM Engines on the vLLM Engine? Would it be the consequences of deploying 2 vLLM Engines manually on the same GPU but with a shared 98 SM Engine for both instances?
Is there any other recommendation on how to deploy 2 engines on the same GPU other than the method that I already tried?
To deploy two vLLM Engines on a single NVIDIA A100 GPU, you can use the --gpu-memory-utilization parameter to limit each instance’s memory usage, as you have done. This approach works because both vLLM instances share the GPU’s Streaming Multiprocessors (SMs), but each is limited to its allocated memory fraction. However, when using NVIDIA MIG with a 4g.20gb profile, each MIG slice only exposes 56 SMs (out of 98), which may be insufficient for vLLM, as some models require more SMs than a single MIG slice provides—hence the crash you observed. The SM count is critical because vLLM relies on high parallelism for performance, and insufficient SMs can prevent model initialization or degrade throughput. For more, see this discussion and this issue.
If you deploy both vLLM Engines on the same GPU without MIG, they will share all 98 SMs, which can lead to resource contention but generally works if you control memory usage. The main consequence is potential performance degradation if both engines are heavily loaded, as they compete for compute resources. There is currently no official vLLM support for hard SM partitioning or dynamic sharing beyond what MIG or CUDA provides. The recommended approach is to use --gpu-memory-utilization for each container and launch them sequentially (not simultaneously) to ensure correct memory allocation, as described in this issue. Alternatively, you can explore elastic KV cache memory pooling (experimental), but this requires additional setup and is not yet mainstream (PR 22706).
Would you like more detail on SMs, MIG, or vLLM deployment strategies?