I have some custom Triton ops, which perform well in force-eager mode. However, when I use cuda graph mode, those ops are negatively optimized into several pieces, leading to a worse performance. How can I keep those Triton ops remain while also enabling graph optimization for other ops?
To prevent your custom Triton ops from being fused or replaced during CUDA Graph (cudagraph) mode, you can use the custom_ops and splitting_ops options in vLLM’s CompilationConfig. Set custom_ops to enable your specific ops (e.g., custom_ops=["none", "+my_triton_op"]), and add your custom op names to splitting_ops so that vLLM will split the graph at those ops, keeping them outside of cudagraphs and thus preserving their original execution. This allows cudagraph optimization for other parts of the model while your custom ops remain in their original form. See the documentation for details on these fields and examples of usage in the config API (docs: custom_ops, splitting_ops).
Would you like a step-by-step example or code snippet for this configuration?
Sources:
Hint: Mention @RunLLM in the post for followups.