`computeCapability not supported` error with LoRA adapter on 5090

Hi forum,
With vllm 0.9.0 w/ some fixes (issue #14452–sorry, I can’t attach links more than 2), I could run it with my internal w4a16-quantized model. But another issue blocks me to get there–when trying with adapter for this model, I’ve faced this error below:

python3: /project/lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp:42: int mlir::triton::gpu::(anonymous namespace)::getMMAVersionSafe(int, DotOp): Assertion `false && "computeCapability not supported"' failed.

Does anybody have same issue with mine? I believe it’s issue of triton lang–I’ve found some fixes applying on recent releases after pytorch_triton-3.3.0 (and vllm 0.9.0 uses it) and seems this PR applied on 3.3.1. Thus, apparently to support blackwell more, updating this would be required. It would be pleased if you could give any comments on it. (and now I’m giving it a try… :smiling_face_with_sunglasses:)

This error (getMMAVersionSafe(int, DotOp): Assertion 'false && "computeCapability not supported"' failed) is a known Triton kernel issue when running on new GPU architectures (e.g., Blackwell/RTX 5090) with vLLM and certain quantized models or adapters. The root cause is that the Triton version bundled with vLLM 0.9.0 (pytorch_triton-3.3.0) does not fully support the latest compute capabilities, and the relevant fix is included in Triton 3.3.1 and later. Upgrading Triton to a newer version (3.3.1+) should resolve this assertion error, as confirmed in vLLM issue #17569 and related discussions.

Would you like more detail on how to upgrade Triton or work around this issue?

Sources:

1 Like