Why cutlass fp8 gemm on sm100 use TN Layout?

YSF-A · September 11, 2025, 9:13am

I notice that fp8 gemm of sm100 is implemented with TN layout. I am curious about the reason.

And by the way, I am curious how to decide the configuration of cutlass gemm with different problem size just like vllm/csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8_dispatch.cuh at a1213fae5f8fe624c4b954768d23a86e384abdb1 · vllm-project/vllm · GitHub

Thanks for your reply.

RunLLM · September 11, 2025, 9:14am

It seems likely that the FP8 GEMM implementation for SM100 (Blackwell) in vLLM uses the TN (transpose-normal) layout because the underlying hardware and CUTLASS kernel definitions for FP8 are optimized for this layout, especially for k-major (column-major) memory access patterns. This is reflected in the kernel configuration and static assertions in the code, which require the input tensors to be in specific layouts for efficient memory access and to match the requirements of the tensor core instructions used for FP8 GEMM. The TN layout helps maximize performance and compatibility with the hardware’s matrix multiplication instructions, as discussed in the kernel and mainloop code comments and configuration files for SM100 FP8 kernels (scaled_mm_sm100_fp8_dispatch.cuh, machete_mainloop.cuh).

Would you like a more detailed step-by-step explanation of the kernel design and memory layout rationale?

Sources:

Topic		Replies	Views
Difference in constraints for FP8 kernels on sm90 vs sm100 General	1	145	October 14, 2025
Native FP8 WMMA Support for AMD RDNA4 (RX 9070 XT / R9700) in vLLM General	5	2072	February 12, 2026
# SM120 (RTX PRO 6000) NVFP4 MoE Performance Report -- Qwen3.5-397B NVIDIA GPU Support	1	701	April 11, 2026
Trying to run gemma-3-27b-it-FP8-dynamic with rocm General	23	888	July 25, 2025
我的显卡是8张L20，为什么运行glm5-FP8报错了 General	0	96	March 19, 2026

Why cutlass fp8 gemm on sm100 use TN Layout?

Related topics