Why I feel cuda-kernel marlin run not fast?

@cyk2018 could you say what GPU you are using? Marlin is designed to work best on Ampere GPUs (aka A100s) since that is the floor of hardware support. The benefit of Marlin is that it can run on so many hardware generations and so many quantization formats, so it takes a performance hit due to that generality. So when running on Hopper or Blackwell, it simply won’t be able to utilize the hardware fully. We have some other kernels for weight-only quant like Machete for Hopper, so it would be good to know your hardware.