Vllm什么情况会使用triton

Easyer · June 12, 2025, 4:44am

vllm什么情况会使用triton，部署PD分离的时候，发现进入了triton分支

RunLLM · June 12, 2025, 4:45am

vLLM会在部分Attention实现中使用Triton（如triton_attn、triton_mla等后端），以提升推理性能。你看到“进入了triton分支”，很可能是vLLM自动选择了Triton相关的Attention后端进行加速，这属于正常现象，并非指NVIDIA Triton Inference Server，而是指triton-lang这个高性能算子编译库。据vllm.v1.attention.backends.triton_attn文档和triton_mla文档，vLLM的部分Attention后端实现就是基于triton-lang。

如果你指的是NVIDIA Triton Inference Server的集成，vLLM也支持作为其后端部署，但这与Attention算子的triton实现是两回事，详见官方Triton集成文档。需要更详细的区分说明吗？

Sources:

Easyer · June 12, 2025, 4:46am

怎么关闭triton，避免走入triton分支

RunLLM · June 12, 2025, 4:47am

要关闭vLLM中的triton算子实现，避免走入triton分支，可以通过设置环境变量来实现。例如，针对Attention相关的triton实现，设置VLLM_USE_TRITON_FLASH_ATTN=0即可关闭triton flash attention，强制使用PyTorch或其他后端。据环境变量文档和ROCm安装文档，该变量适用于大多数场景。

示例：

export VLLM_USE_TRITON_FLASH_ATTN=0

如需关闭其他triton相关算子，可查阅环境变量文档中以VLLM_USE_TRITON_开头的变量。需要更详细的操作说明吗？

Sources:

Easyer · June 12, 2025, 4:48am

怎么关闭pytorch的triton? 是使用环境变量吗？

RunLLM · June 12, 2025, 4:48am

是的，可以通过环境变量关闭PyTorch的triton支持。根据vLLM PR #10747，设置环境变量TORCH_COMPILE_DISABLE=True即可禁用torch.compile，从而避免triton相关算子的调用。

示例：

export TORCH_COMPILE_DISABLE=True

这样PyTorch不会自动调用triton相关优化。需要更详细的原理或适用场景说明吗？

Sources:

vLLM PR #10747

Easyer · June 12, 2025, 4:49am

PYTORCH_NO_TRITON 那这个变量是控制什么的

Topic		Replies	Views
Vllm triton相关的配置有哪些？ General	3	50	June 12, 2025
LoRA Adapter enabling with vLLM is not working LoRA	4	205	April 21, 2025
Build VLLM without installing gcc? General	1	191	June 20, 2025
How vllm v1 enable compile as default feature torch.compile integration	5	78	July 2, 2025
`computeCapability not supported` error with LoRA adapter on 5090 General	1	53	June 10, 2025

Vllm什么情况会使用triton

Related topics