Now in distributed inference, communcation has big percent in whole latency. Is any plan to support an optimization about computation and communication overlapping?
No way actually. That is the reason HBM or all-in-one super waffle out there.
There are some techniques about overlap in other framework, such as deepspeed/domino.