Any plan to support an optimization about computation and communication overlapping

huizhanyi · April 1, 2025, 6:55am

Now in distributed inference, communcation has big percent in whole latency. Is any plan to support an optimization about computation and communication overlapping?

DystopianJunkyardKid · April 1, 2025, 1:59pm

No way actually. That is the reason HBM or all-in-one super waffle out there.

huizhanyi · April 2, 2025, 12:38am

There are some techniques about overlap in other framework, such as deepspeed/domino.

Topic		Replies	Views
Offline multi-node inference General	7	13	May 6, 2025
Pipeline Parallelism Support - Source Code Location Features	1	53	March 25, 2025
Welcome to vLLM Forums! :wave: General	1	406	March 24, 2025
What is the optimal combination of parallelism when deploying deepseek-r1 with multiple nodes? General	0	12	May 2, 2025
Grammar CPU bound performance Structured Outputs	9	104	April 29, 2025

Any plan to support an optimization about computation and communication overlapping

Related topics