Context: The following issue corresponds to FP8 blockwise matmul cutlass kernel. The matmul uses weight tensor (ndim: 2) and weight_scale tensor (ndim: 2) to perform the matmul. However, I found the following difference in constraints for sm90 vs sm100.
I would like to understand why there are two different type of constraints. For eg: Deepseekv3 model uses: kv_a_proj_with_mqa which doesn’t satisfy sm90 criteria but satisfies the sm100 criteria. Does it mean we cannot run Deepseekv3 on sm90?
The difference in constraints for blockwise FP8 matmul between SM90 and SM100 arises from how each architecture’s kernel expects the scaling tensor shapes to align with the weight tensor. On SM90, the requirement is ceil(weight.shape[0] / weight_scale.shape[0]) == 128 && ceil(weight.shape[1] / weight_scale.shape[1]) == 128, meaning the scaling tensor must “tile” the weight tensor in 128x128 blocks, and the number of blocks must exactly match the scaling tensor shape. On SM100, the check is ceil(weight.shape[0] / 128) == weight_scale.shape[0] && ceil(weight.shape[1] / 128) == weight_scale.shape[1], which is a more direct mapping: the scaling tensor shape must be the number of 128-sized blocks in each dimension, but does not require the same “group” broadcasting logic as SM90. This difference is due to kernel implementation details and how each architecture’s kernel handles blockwise scaling and broadcasting rules (see code and comments).
As a result, if a model like DeepSeekV3 uses a weight/scale shape that satisfies the SM100 constraint but not the SM90 one, it cannot run on SM90 with the blockwise FP8 kernel, because the kernel will raise a shape error (see error example). This is a hard limitation of the current kernel implementations and not just a documentation artifact. For such models, you must use SM100 (Blackwell) or adjust the quantization/scaling scheme to match SM90’s requirements (see also).
Would you like a step-by-step breakdown of the kernel logic and broadcasting rules?