compared to 2x 5090 32GB which have same amount of cuda cores and using them in tensor-parallel = 2 to a 1x 5090 to rtx pro 4500 which also has 32GB but half less cuda cores. How will that weaker performance card affect the inference speed? will it drop 50%?
It seems likely that using a weaker GPU (such as an RTX Pro 4500 with half the CUDA cores of a 5090) in a tensor-parallel setup with a 5090 will significantly reduce inference speed, potentially bottlenecking the entire system to the speed of the slowest GPU. The performance drop could approach 50%, since tensor parallelism requires all GPUs to synchronize at each step, and the slowest GPU determines the overall throughput. This is not explicitly quantified in the data sources, but the general principle is well established in distributed inference literature and is implied in vLLM discussions about inter-GPU communication and parallelism overheads (Parallelism and Scaling - vLLM, github issue #367, github issue #689).
Would you like a more detailed breakdown of how tensor parallelism works and why the slowest GPU sets the pace?
It seems likely that PCIe 5.0 x16 provides sufficient bandwidth for tensor parallel=2 on 2x 5090s, especially for moderate context lengths like 4080. However, vLLM sources note that NVLink is preferred for minimizing inter-GPU communication overhead, and PCIe—even at Gen5 speeds—may still introduce some latency, particularly as context length or batch size increases. For tensor parallelism, the communication overhead is lower at smaller parallel sizes (like 2), so PCIe 5.0 x16 should generally be adequate, but not as optimal as NVLink (Parallelism and Scaling - vLLM, github issue #732).
Would you like more detail on how PCIe bandwidth impacts tensor parallel performance at different context lengths or batch sizes?