Run vLLM on two diffrent GPU

RunLLM · May 21, 2025, 11:46am

vLLM splits model parameters evenly across GPUs when using tensor parallelism, so each GPU gets the same share of the model, limited by the smallest GPU’s memory. This means both your 8GB and 16GB GPUs will only use up to the capacity of the 8GB GPU, leaving the extra memory on the 16GB GPU unused for model weights or KV cache. This is expected behavior and not configurable in vLLM as per the current documentation and issues[^1][^2].

Would you like more detail on how to optimize memory usage or alternative parallelism strategies?

Sources:

vLLM GitHub Issue: cannot load GGUF model on multi GPU

Topic		Replies	Views
Making best use of varying GPU generations NVIDIA GPU Support	2	960	April 11, 2025
Does vllm support deploy multiple docker instance on one gpu General	1	948	May 21, 2025
2 vllm containers on a single GPU General	3	2052	August 27, 2025
An issue about using multiple GPU to deploy multiple models with vllm General	1	110	May 28, 2026
torch.OutOfMemoryError: CUDA out of memory General	15	1929	March 3, 2026

Run vLLM on two diffrent GPU

Related topics