Why I feel cuda-kernel marlin run not fast?

cyk2018 · January 9, 2026, 6:29am

Hello, Every one. When I run the Qwen3-32B-awq model with vllm, I found the profile result show that the marlin kernel is slow, I think this operator is memory bound and I calculate the memory unit rate find almost slow than 60%, So I want to know why this happened and how can We fix this?

RunLLM · January 9, 2026, 6:30am

The Marlin kernel for AWQ/GPTQ quantized models in vLLM is indeed memory bandwidth bound, especially at large batch sizes or long context lengths. Profiling and benchmarks show that Marlin achieves high speedups at small to moderate batch sizes, but its efficiency drops (to ~60% or less of peak bandwidth) as batch size increases, due to memory access patterns and hardware limitations. This is a known limitation, and recent Marlin updates have improved large-batch performance, but the kernel is still fundamentally limited by GPU memory bandwidth, not compute. There is ongoing work to further optimize this, but no simple fix exists—performance is already close to the hardware’s memory ceiling for these workloads. See the detailed discussion and benchmarks in vllm#2497 and vllm#11286.

Would you like a step-by-step breakdown of the memory bottleneck and possible tuning strategies?

Sources:

Hint: Mention @RunLLM in the post for followups.

cyk2018 · January 9, 2026, 6:48am

@RunLLM My Question is why I calculate the memory utilization rate and found it is slow though the operator is memory bound

RunLLM · January 9, 2026, 6:48am

This is expected: even though the Marlin kernel is memory bound, actual memory bandwidth utilization is often well below 100% due to non-ideal memory access patterns, fragmentation, and hardware constraints. The kernel may not fully saturate the GPU’s theoretical bandwidth because of factors like uncoalesced accesses, cache misses, or suboptimal alignment, especially with quantized weights. This is a common limitation for memory-bound operators on GPUs, as discussed in vllm#11286.

Would you like more detail on how these factors limit memory bandwidth utilization in practice?

Sources:

Hint: Mention @RunLLM in the post for followups.

mgoin · January 9, 2026, 4:58pm

@cyk2018 could you say what GPU you are using? Marlin is designed to work best on Ampere GPUs (aka A100s) since that is the floor of hardware support. The benefit of Marlin is that it can run on so many hardware generations and so many quantization formats, so it takes a performance hit due to that generality. So when running on Hopper or Blackwell, it simply won’t be able to utilize the hardware fully. We have some other kernels for weight-only quant like Machete for Hopper, so it would be good to know your hardware.

cyk2018 · January 9, 2026, 5:31pm

Very happy to hear your reply, my gpu is A800-80G, So I am confused why marlin seems only 30-40% by the theory memory rate

Topic		Replies	Views
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	1	648	July 23, 2025
Jetson Orin + vLLM Qwen3-0.6B quantized models – GPU active but no speedup, need optimization tips General	1	134	April 20, 2026
# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B NVIDIA GPU Support	1	1070	April 11, 2026
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	794	October 21, 2025
No throughput improvement for Quantized Qwen-2.5-7B-Instruct? General	4	258	November 13, 2025

Why I feel cuda-kernel marlin run not fast?

Related topics