How can I setup custom triton ops in vllm?

It seems likely that simply defining your custom Triton kernel and calling it in the model’s forward method is not enough for vLLM to properly register and use the op, especially with torch.compile and vLLM’s custom op infrastructure. vLLM expects custom ops to be registered following PyTorch’s guidelines, including schema registration, meta-function implementation, and proper integration with the compilation system. You should also ensure your op is registered as a subclass of CustomOp and that it is enabled in the CompilationConfig (see the custom_ops field) if you want it to be used during graph compilation. Otherwise, vLLM may fall back to tracing the entire function, causing long compile times or hangs, as you observed. See the contributing guide on adding or changing kernels and custom op documentation for details.

Would you like a step-by-step breakdown of how to properly register and use a custom Triton op in vLLM, including code snippets?

Sources:

Hint: Mention @RunLLM in the post for followups.

Yes, please. I need a step-by-step breakdown of how to properly register and use a custom Triton op in vLLM, including code snippets?