How can I setup custom triton ops in vllm?

I attempted to replace the Qwen3 FFN with my own fused Triton kernel. To do this, I created a new file model_executor/custom_ops/qwen3_fused_ffn.py, which defines both the kernel qwen3_swiglu_fused_kernel and its API wrapper qwen3_swiglu_fused. Then, I modified the forward method of the Qwen3MLP class in model_executor/models/qwen3.py to invoke qwen3_swiglu_fused. However, when I launch vllm serve, the process gets stuck right after printing Compiling a graph for general shape takes 59.56s.

How should I use my custom triton ops?

It seems likely that simply defining your custom Triton kernel and calling it in the model’s forward method is not enough for vLLM to properly register and use the op, especially with torch.compile and vLLM’s custom op infrastructure. vLLM expects custom ops to be registered following PyTorch’s guidelines, including schema registration, meta-function implementation, and proper integration with the compilation system. You should also ensure your op is registered as a subclass of CustomOp and that it is enabled in the CompilationConfig (see the custom_ops field) if you want it to be used during graph compilation. Otherwise, vLLM may fall back to tracing the entire function, causing long compile times or hangs, as you observed. See the contributing guide on adding or changing kernels and custom op documentation for details.

Would you like a step-by-step breakdown of how to properly register and use a custom Triton op in vLLM, including code snippets?

Sources:

Hint: Mention @RunLLM in the post for followups.

Yes, please. I need a step-by-step breakdown of how to properly register and use a custom Triton op in vLLM, including code snippets?