Run vLLM on two diffrent GPU

Joseph · May 21, 2025, 11:45am

I have two NVIDIA GPU, one with 8GB vRAM and one with 16GB vRAM.
When run a Qwen model with vLLM, this is my GPU usages:

| NVIDIA-SMI 550.144.03 Driver Version: 550.144.03 CUDA Version: 12.4 |

Why use same vRAM from Both gpu and don’t use full capacity of vRAM’s?

RunLLM · May 21, 2025, 11:46am

vLLM splits model parameters evenly across GPUs when using tensor parallelism, so each GPU gets the same share of the model, limited by the smallest GPU’s memory. This means both your 8GB and 16GB GPUs will only use up to the capacity of the 8GB GPU, leaving the extra memory on the 16GB GPU unused for model weights or KV cache. This is expected behavior and not configurable in vLLM as per the current documentation and issues[^1][^2].

Would you like more detail on how to optimize memory usage or alternative parallelism strategies?

Sources:

vLLM GitHub Issue: cannot load GGUF model on multi GPU

Topic		Replies	Views
Making best use of varying GPU generations NVIDIA GPU Support	2	137	April 11, 2025
2 vllm containers on a single GPU General	1	353	May 13, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	262	May 21, 2025
The new V1 way to ~--cpu-offload-gb V1 Feedback	5	1090	April 13, 2025
32GB vs 48GB vRam NVIDIA GPU Support	1	49	May 3, 2025

Run vLLM on two diffrent GPU

Related topics