It seems likely that the FP8 GEMM implementation for SM100 (Blackwell) in vLLM uses the TN (transpose-normal) layout because the underlying hardware and CUTLASS kernel definitions for FP8 are optimized for this layout, especially for k-major (column-major) memory access patterns. This is reflected in the kernel configuration and static assertions in the code, which require the input tensors to be in specific layouts for efficient memory access and to match the requirements of the tensor core instructions used for FP8 GEMM. The TN layout helps maximize performance and compatibility with the hardware’s matrix multiplication instructions, as discussed in the kernel and mainloop code comments and configuration files for SM100 FP8 kernels (scaled_mm_sm100_fp8_dispatch.cuh, machete_mainloop.cuh).
Would you like a more detailed step-by-step explanation of the kernel design and memory layout rationale?