Can Ascend officially draft a documentation on the vLLM-Ascend adaptation for graph mode?

kinnn8899 · March 21, 2025, 5:31am

yikun · March 21, 2025, 8:50am

@kinnn8899 Thanks for feedback, the graph mode (aka compile mode) are still working in prorgress.

It will be implemented by using torchair [1] backend via custom torch.compile backend key [2].

Please let us know if you have more question here.

[1] GitHub - Ascend/torchair
[2] [platform] support custom torch.compile backend key by wangxiyuan · Pull Request #11318 · vllm-project/vllm · GitHub

kinnn8899 · March 21, 2025, 9:03am

TBH, i want know this thing:
After the framework has adapted the graph mode(torchair), does the model side still need to be adapted? What does the model side need to do if it needs to be adapted?

yikun · March 21, 2025, 1:21pm

@kinnn8899 No, no more action needed if it’s supported, there is a reduce-overhead mode in torchair to help speed up model seamlessly:

But it is still working in process, we are also cooperate with torch npu team to complete it.

The roadmap also not ready yet (Maybe 2025 Q2 or Q3?), we will provide more doc about it once it’s supported.

Chuanyu · March 22, 2025, 6:22am

Basically: Graph mode provides two core acceleration capabilities, kernel fusion and framework overhead reduce benefits, all capabilities will be provided based on torch.compile.
Regarding reduce-overhead , which is equivalent to cudagraph functionality, named aclgraph will be officially released by Q2 at the latest.
Regarding automatic fusion, multiple teams are attempting different implementation approaches, and once mature, we will introduce and integrate them.
Torchair, as the graph mode bridge between torch and ascend, will provide different user experiences in the future through various config_mode options.
If the automatic fusion capability based on inductor matures in the future, we will also consider directly providing inductor-npu backend.

Chuanyu · March 22, 2025, 6:33am

if you only want to get framework overhead-reduce benifit, nothing but orignal torch.compile work is needed.
if you want to get kernel auto fusion benifit, different fusion backend may need different external work.

yikun · March 26, 2025, 10:17am

Will the memory usage be increased (like cuda graph) when run models (such as deepseek) with torchair reduce-overhead mode?

A question from vLLM Ascend weekly meeting.

Topic		Replies	Views
How to get torch-npu >= 2.5.1.dev20250308 Ascend Support	3	148	May 28, 2025
Question about vLLM and vLLM Ascend verisoning policy Ascend Support	4	186	April 1, 2025
Questions about cuda graph compatibility with Attention Backend in vLLM General	1	48	May 22, 2025
Questions on piecewise torch compile design torch.compile integration	36	160	May 29, 2025
Will CPU Offload be supported in V1? General	3	409	March 24, 2025

Can Ascend officially draft a documentation on the vLLM-Ascend adaptation for graph mode?

Related topics