Combining 2 different GPUs

youlearnit · September 14, 2025, 7:59pm

Hi,

compared to 2x 5090 32GB which have same amount of cuda cores and using them in tensor-parallel = 2 to a 1x 5090 to rtx pro 4500 which also has 32GB but half less cuda cores. How will that weaker performance card affect the inference speed? will it drop 50%?

RunLLM · September 14, 2025, 8:00pm

It seems likely that using a weaker GPU (such as an RTX Pro 4500 with half the CUDA cores of a 5090) in a tensor-parallel setup with a 5090 will significantly reduce inference speed, potentially bottlenecking the entire system to the speed of the slowest GPU. The performance drop could approach 50%, since tensor parallelism requires all GPUs to synchronize at each step, and the slowest GPU determines the overall throughput. This is not explicitly quantified in the data sources, but the general principle is well established in distributed inference literature and is implied in vLLM discussions about inter-GPU communication and parallelism overheads (Parallelism and Scaling - vLLM, github issue #367, github issue #689).

Would you like a more detailed breakdown of how tensor parallelism works and why the slowest GPU sets the pace?

Sources:

youlearnit · September 14, 2025, 8:01pm

alright, so 2x 5090 is better choise. Will pcie 5.0 16x link be enough for tensor parallel=2 if context lenght is 4080

RunLLM · September 14, 2025, 8:01pm

It seems likely that PCIe 5.0 x16 provides sufficient bandwidth for tensor parallel=2 on 2x 5090s, especially for moderate context lengths like 4080. However, vLLM sources note that NVLink is preferred for minimizing inter-GPU communication overhead, and PCIe—even at Gen5 speeds—may still introduce some latency, particularly as context length or batch size increases. For tensor parallelism, the communication overhead is lower at smaller parallel sizes (like 2), so PCIe 5.0 x16 should generally be adequate, but not as optimal as NVLink (Parallelism and Scaling - vLLM, github issue #732).

Would you like more detail on how PCIe bandwidth impacts tensor parallel performance at different context lengths or batch sizes?

Sources:

Topic		Replies	Views
Is there recommended max upper limit for tensor parallel General	11	1165	July 26, 2025
[求助]What is Impact on Inferencing Speed with Tensor Parallel General	1	61	November 18, 2025
Data parallel or tensor parallel General	3	521	September 14, 2025
vLLM does not work with 2x 5090 in tp 2 General	8	817	September 18, 2025
Run vLLM on two diffrent GPU General	1	963	May 21, 2025

Combining 2 different GPUs

Related topics