vLLM splits model parameters evenly across GPUs when using tensor parallelism, so each GPU gets the same share of the model, limited by the smallest GPU’s memory. This means both your 8GB and 16GB GPUs will only use up to the capacity of the 8GB GPU, leaving the extra memory on the 16GB GPU unused for model weights or KV cache. This is expected behavior and not configurable in vLLM as per the current documentation and issues[^1][^2].
Would you like more detail on how to optimize memory usage or alternative parallelism strategies?
Sources: