Colfax on Blackwell GEMMs

CUTLASS Tutorial: Writing GEMM Kernels Using Tensor Memory For NVIDIA® Blackwell GPUs – Colfax Research

Dives deep into TMEM into particular, and the trend over the last few Nvidia generations of special-casing GEMMS in hardware:

Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations.

Edit: link no longer seems to be working! It was a great post though, so hopefully comes back! Edit edit: it did!

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading