[求助]What is Impact on Inferencing Speed with Tensor Parallel

Hello guys,
for a model which is able to fit in one GPU, will Tensor Parallel(TP) reduce or increase response time?

I googled it and most says TP will slow down inferencing because of communication overheads. But I tested it with Qwen3-32B-AWQ is that TP on two 3090GPU is faster than on one. ~56 tokens/s vs 40 tokens/s. (single request)
Am I doing anything wrong?
I understood it has many factors such as model structure, gain of parallel computation and communication overheads. But I do not have chance to test more models. It there a general conclusion?

if you have any opinions please answer. Thanks:)

用Tensor Parallel推理会比单卡快吗?我之前看的一直都是说单卡比用TP延时低,因为TP有额外的通信开销。但是我用两张3090设TP起 Qwen3-32B-AWQ,对比单卡起的,不考虑并发单个请求输出从40tokens/s变快到56tokens/s了,看起来TP更快,是为什么呢?
我目前没机会测其它的模型了,通常来说TP对推理延时的影响是什么样的呢?感谢各位大佬。

一般来说,如果模型能完全放入单卡显存,Tensor Parallel(TP)通常会因多卡间通信开销导致推理延迟增加,尤其是小模型或小批量推理时更明显。但在部分大模型(如Qwen3-32B-AWQ)上,TP有时能提升吞吐(如你测试的tokens/s提升),原因可能是单卡算力未被充分利用,多卡并行能更好发挥硬件性能,抵消了通信损耗。实际表现受模型结构、显卡带宽、并发量等多因素影响,不能一概而论。官方文档和社区讨论均指出,TP主要用于模型过大单卡放不下时,或需提升KV cache空间时才有明显优势,否则通信成本可能抵消并行收益,甚至变慢(详见官方优化文档社区讨论1社区讨论2)。

你的测试结果并不罕见,部分情况下TP确实能提升吞吐,但这不是普遍规律。一般建议:小模型/低并发用单卡,大模型/高并发或显存不足时用TP。具体还需结合实际硬件和模型多做测试。需要详细原理和案例分析吗?

Sources:

Hint: Mention @RunLLM in the post for followups.

1 Like