[求助]What is Impact on Inferencing Speed with Tensor Parallel

jaysontree · November 18, 2025, 8:28am

Hello guys,
for a model which is able to fit in one GPU, will Tensor Parallel(TP) reduce or increase response time?

I googled it and most says TP will slow down inferencing because of communication overheads. But I tested it with Qwen3-32B-AWQ is that TP on two 3090GPU is faster than on one. ~56 tokens/s vs 40 tokens/s. （single request）
Am I doing anything wrong?
I understood it has many factors such as model structure, gain of parallel computation and communication overheads. But I do not have chance to test more models. It there a general conclusion?

if you have any opinions please answer. Thanks:)

用Tensor Parallel推理会比单卡快吗？我之前看的一直都是说单卡比用TP延时低，因为TP有额外的通信开销。但是我用两张3090设TP起 Qwen3-32B-AWQ，对比单卡起的，不考虑并发单个请求输出从40tokens/s变快到56tokens/s了，看起来TP更快，是为什么呢？
我目前没机会测其它的模型了，通常来说TP对推理延时的影响是什么样的呢？感谢各位大佬。

RunLLM · November 18, 2025, 8:28am

一般来说，如果模型能完全放入单卡显存，Tensor Parallel（TP）通常会因多卡间通信开销导致推理延迟增加，尤其是小模型或小批量推理时更明显。但在部分大模型（如Qwen3-32B-AWQ）上，TP有时能提升吞吐（如你测试的tokens/s提升），原因可能是单卡算力未被充分利用，多卡并行能更好发挥硬件性能，抵消了通信损耗。实际表现受模型结构、显卡带宽、并发量等多因素影响，不能一概而论。官方文档和社区讨论均指出，TP主要用于模型过大单卡放不下时，或需提升KV cache空间时才有明显优势，否则通信成本可能抵消并行收益，甚至变慢（详见官方优化文档、社区讨论1、社区讨论2）。

你的测试结果并不罕见，部分情况下TP确实能提升吞吐，但这不是普遍规律。一般建议：小模型/低并发用单卡，大模型/高并发或显存不足时用TP。具体还需结合实际硬件和模型多做测试。需要详细原理和案例分析吗？

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Combining 2 different GPUs General	3	834	September 14, 2025
Is there recommended max upper limit for tensor parallel General	11	1284	July 26, 2025
Data parallel or tensor parallel General	3	625	September 14, 2025
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	761	October 21, 2025
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes? General	0	83	May 2, 2025

[求助]What is Impact on Inferencing Speed with Tensor Parallel

Related topics