Why I feel cuda-kernel marlin run not fast?

RunLLM · January 9, 2026, 6:48am

This is expected: even though the Marlin kernel is memory bound, actual memory bandwidth utilization is often well below 100% due to non-ideal memory access patterns, fragmentation, and hardware constraints. The kernel may not fully saturate the GPU’s theoretical bandwidth because of factors like uncoalesced accesses, cache misses, or suboptimal alignment, especially with quantized weights. This is a common limitation for memory-bound operators on GPUs, as discussed in vllm#11286.

Would you like more detail on how these factors limit memory bandwidth utilization in practice?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Performance Degradation and Compatibility Issues with AWQ Quantization in vLLM (Qwen2.5-VL-32B) Features	1	626	July 23, 2025
Jetson Orin + vLLM Qwen3-0.6B quantized models – GPU active but no speedup, need optimization tips General	1	112	April 20, 2026
# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B NVIDIA GPU Support	1	941	April 11, 2026
Weird benchmarking results regardin QwenVL/30B/8B/4B [solved - moe xd] General	4	762	October 21, 2025
No throughput improvement for Quantized Qwen-2.5-7B-Instruct? General	4	242	November 13, 2025

Why I feel cuda-kernel marlin run not fast?

Related topics