Run vLLM on two diffrent GPU

I have two NVIDIA GPU, one with 8GB vRAM and one with 16GB vRAM.
When run a Qwen model with vLLM, this is my GPU usages:

| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |

| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |

| 0 NVIDIA GeForce RTX 4060 Ti Off | 00000000:01:00.0 Off | N/A
| 0% 41C P8 14W / 165W | 5300MiB / 16380MiB | 0% Default

| 1 NVIDIA GeForce RTX 4060 Ti Off | 00000000:05:00.0 Off | N/A
| 0% 40C P8 7W / 160W | 5290MiB / 8188MiB | 0% Default

Why use same vRAM from Both gpu and don’t use full capacity of vRAM’s?

vLLM splits model parameters evenly across GPUs when using tensor parallelism, so each GPU gets the same share of the model, limited by the smallest GPU’s memory. This means both your 8GB and 16GB GPUs will only use up to the capacity of the 8GB GPU, leaving the extra memory on the 16GB GPU unused for model weights or KV cache. This is expected behavior and not configurable in vLLM as per the current documentation and issues[^1][^2].

Would you like more detail on how to optimize memory usage or alternative parallelism strategies?

Sources: