Why cutlass fp8 gemm on sm100 use TN Layout?

I notice that fp8 gemm of sm100 is implemented with TN layout. I am curious about the reason.

And by the way, I am curious how to decide the configuration of cutlass gemm with different problem size just like vllm/csrc/quantization/cutlass_w8a8/c3x/scaled_mm_sm100_fp8_dispatch.cuh at a1213fae5f8fe624c4b954768d23a86e384abdb1 · vllm-project/vllm · GitHub

Thanks for your reply.

It seems likely that the FP8 GEMM implementation for SM100 (Blackwell) in vLLM uses the TN (transpose-normal) layout because the underlying hardware and CUTLASS kernel definitions for FP8 are optimized for this layout, especially for k-major (column-major) memory access patterns. This is reflected in the kernel configuration and static assertions in the code, which require the input tensors to be in specific layouts for efficient memory access and to match the requirements of the tensor core instructions used for FP8 GEMM. The TN layout helps maximize performance and compatibility with the hardware’s matrix multiplication instructions, as discussed in the kernel and mainloop code comments and configuration files for SM100 FP8 kernels (scaled_mm_sm100_fp8_dispatch.cuh, machete_mainloop.cuh).

Would you like a more detailed step-by-step explanation of the kernel design and memory layout rationale?

Sources: