Hi forum,
With vllm 0.9.0 w/ some fixes (issue #14452–sorry, I can’t attach links more than 2), I could run it with my internal w4a16-quantized model. But another issue blocks me to get there–when trying with adapter for this model, I’ve faced this error below:
python3: /project/lib/Dialect/TritonGPU/Transforms/AccelerateMatmul.cpp:42: int mlir::triton::gpu::(anonymous namespace)::getMMAVersionSafe(int, DotOp): Assertion `false && "computeCapability not supported"' failed.
Does anybody have same issue with mine? I believe it’s issue of triton lang–I’ve found some fixes applying on recent releases after pytorch_triton-3.3.0 (and vllm 0.9.0 uses it) and seems this PR applied on 3.3.1. Thus, apparently to support blackwell more, updating this would be required. It would be pleased if you could give any comments on it. (and now I’m giving it a try…
)