Difference in constraints for FP8 kernels on sm90 vs sm100

sirish-gambhira · October 14, 2025, 5:04pm

Hello all,

Context: The following issue corresponds to FP8 blockwise matmul cutlass kernel. The matmul uses weight tensor (ndim: 2) and weight_scale tensor (ndim: 2) to perform the matmul. However, I found the following difference in constraints for sm90 vs sm100.

sm90:

This indicates that sm90 requires ceil(weight.shape[0] / weight_scale.shape[0]) == 128 && ceil(weight.shape[1] / weight_scale.shape[1]) == 128.

However, for sm100:

We require ceil(weight.shape[0] / 128) == weight_scale.shape[0] && ceil(weight.shape[1] / 128) == weight_scale.shape[1]).

I would like to understand why there are two different type of constraints. For eg: Deepseekv3 model uses: kv_a_proj_with_mqa which doesn’t satisfy sm90 criteria but satisfies the sm100 criteria. Does it mean we cannot run Deepseekv3 on sm90?

Thank you for your time and consideration.

RunLLM · October 14, 2025, 5:04pm

The difference in constraints for blockwise FP8 matmul between SM90 and SM100 arises from how each architecture’s kernel expects the scaling tensor shapes to align with the weight tensor. On SM90, the requirement is ceil(weight.shape[0] / weight_scale.shape[0]) == 128 && ceil(weight.shape[1] / weight_scale.shape[1]) == 128, meaning the scaling tensor must “tile” the weight tensor in 128x128 blocks, and the number of blocks must exactly match the scaling tensor shape. On SM100, the check is ceil(weight.shape[0] / 128) == weight_scale.shape[0] && ceil(weight.shape[1] / 128) == weight_scale.shape[1], which is a more direct mapping: the scaling tensor shape must be the number of 128-sized blocks in each dimension, but does not require the same “group” broadcasting logic as SM90. This difference is due to kernel implementation details and how each architecture’s kernel handles blockwise scaling and broadcasting rules (see code and comments).

As a result, if a model like DeepSeekV3 uses a weight/scale shape that satisfies the SM100 constraint but not the SM90 one, it cannot run on SM90 with the blockwise FP8 kernel, because the kernel will raise a shape error (see error example). This is a hard limitation of the current kernel implementations and not just a documentation artifact. For such models, you must use SM100 (Blackwell) or adjust the quantization/scaling scheme to match SM90’s requirements (see also).

Would you like a step-by-step breakdown of the kernel logic and broadcasting rules?

Sources:

Hint: Mention @RunLLM in the post for followups.

Topic		Replies	Views
Why cutlass fp8 gemm on sm100 use TN Layout? General	1	148	September 11, 2025
Qwen3.5-35b-a3b-fp8 显存越界 General	1	292	March 12, 2026
# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B NVIDIA GPU Support	1	700	April 11, 2026
Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM General	5	2071	February 12, 2026
我的显卡是8张L20，为什么运行glm5-FP8报错了 General	0	96	March 19, 2026

Difference in constraints for FP8 kernels on sm90 vs sm100

Related topics