Why I feel cuda-kernel marlin run not fast?

Hello, Every one. When I run the Qwen3-32B-awq model with vllm, I found the profile result show that the marlin kernel is slow, I think this operator is memory bound and I calculate the memory unit rate find almost slow than 60%, So I want to know why this happened and how can We fix this?

The Marlin kernel for AWQ/GPTQ quantized models in vLLM is indeed memory bandwidth bound, especially at large batch sizes or long context lengths. Profiling and benchmarks show that Marlin achieves high speedups at small to moderate batch sizes, but its efficiency drops (to ~60% or less of peak bandwidth) as batch size increases, due to memory access patterns and hardware limitations. This is a known limitation, and recent Marlin updates have improved large-batch performance, but the kernel is still fundamentally limited by GPU memory bandwidth, not compute. There is ongoing work to further optimize this, but no simple fix exists—performance is already close to the hardware’s memory ceiling for these workloads. See the detailed discussion and benchmarks in vllm#2497 and vllm#11286.

Would you like a step-by-step breakdown of the memory bottleneck and possible tuning strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

@RunLLM My Question is why I calculate the memory utilization rate and found it is slow though the operator is memory bound

This is expected: even though the Marlin kernel is memory bound, actual memory bandwidth utilization is often well below 100% due to non-ideal memory access patterns, fragmentation, and hardware constraints. The kernel may not fully saturate the GPU’s theoretical bandwidth because of factors like uncoalesced accesses, cache misses, or suboptimal alignment, especially with quantized weights. This is a common limitation for memory-bound operators on GPUs, as discussed in vllm#11286.

Would you like more detail on how these factors limit memory bandwidth utilization in practice?

Sources:

Hint: Mention @RunLLM in the post for followups.

@cyk2018 could you say what GPU you are using? Marlin is designed to work best on Ampere GPUs (aka A100s) since that is the floor of hardware support. The benefit of Marlin is that it can run on so many hardware generations and so many quantization formats, so it takes a performance hit due to that generality. So when running on Hopper or Blackwell, it simply won’t be able to utilize the hardware fully. We have some other kernels for weight-only quant like Machete for Hopper, so it would be good to know your hardware.

Very happy to hear your reply, my gpu is A800-80G, So I am confused why marlin seems only 30-40% by the theory memory rate