This is expected: even though the Marlin kernel is memory bound, actual memory bandwidth utilization is often well below 100% due to non-ideal memory access patterns, fragmentation, and hardware constraints. The kernel may not fully saturate the GPU’s theoretical bandwidth because of factors like uncoalesced accesses, cache misses, or suboptimal alignment, especially with quantized weights. This is a common limitation for memory-bound operators on GPUs, as discussed in vllm#11286.
Would you like more detail on how these factors limit memory bandwidth utilization in practice?
Sources:
Hint: Mention @RunLLM in the post for followups.