Can Ascend officially draft a documentation on the vLLM-Ascend adaptation for graph mode?

Can Ascend officially draft a documentation on the vLLM-Ascend adaptation for graph mode?

@kinnn8899 Thanks for feedback, the graph mode (aka compile mode) are still working in prorgress.

It will be implemented by using torchair [1] backend via custom torch.compile backend key [2].

Please let us know if you have more question here.

[1] GitHub - Ascend/torchair
[2] [platform] support custom torch.compile backend key by wangxiyuan · Pull Request #11318 · vllm-project/vllm · GitHub

TBH, i want know this thing:
After the framework has adapted the graph mode(torchair), does the model side still need to be adapted? What does the model side need to do if it needs to be adapted?

@kinnn8899 No, no more action needed if it’s supported, there is a reduce-overhead mode in torchair to help speed up model seamlessly:

But it is still working in process, we are also cooperate with torch npu team to complete it.

The roadmap also not ready yet (Maybe 2025 Q2 or Q3?), we will provide more doc about it once it’s supported.

Basically: Graph mode provides two core acceleration capabilities, kernel fusion and framework overhead reduce benefits, all capabilities will be provided based on torch.compile.
Regarding reduce-overhead , which is equivalent to cudagraph functionality, named aclgraph will be officially released by Q2 at the latest.
Regarding automatic fusion, multiple teams are attempting different implementation approaches, and once mature, we will introduce and integrate them.
Torchair, as the graph mode bridge between torch and ascend, will provide different user experiences in the future through various config_mode options.
If the automatic fusion capability based on inductor matures in the future, we will also consider directly providing inductor-npu backend.

2 Likes

if you only want to get framework overhead-reduce benifit, nothing but orignal torch.compile work is needed.
if you want to get kernel auto fusion benifit, different fusion backend may need different external work.

Will the memory usage be increased (like cuda graph) when run models (such as deepseek) with torchair reduce-overhead mode?

A question from vLLM Ascend weekly meeting.