Hello guys,
for a model which is able to fit in one GPU, will Tensor Parallel(TP) reduce or increase response time?
I googled it and most says TP will slow down inferencing because of communication overheads. But I tested it with Qwen3-32B-AWQ is that TP on two 3090GPU is faster than on one. ~56 tokens/s vs 40 tokens/s. (single request)
Am I doing anything wrong?
I understood it has many factors such as model structure, gain of parallel computation and communication overheads. But I do not have chance to test more models. It there a general conclusion?
if you have any opinions please answer. Thanks:)
用Tensor Parallel推理会比单卡快吗?我之前看的一直都是说单卡比用TP延时低,因为TP有额外的通信开销。但是我用两张3090设TP起 Qwen3-32B-AWQ,对比单卡起的,不考虑并发单个请求输出从40tokens/s变快到56tokens/s了,看起来TP更快,是为什么呢?
我目前没机会测其它的模型了,通常来说TP对推理延时的影响是什么样的呢?感谢各位大佬。