Dives deep into TMEM into particular, and the trend over the last few Nvidia generations of special-casing GEMMS in hardware:
Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations.
Edit: link no longer seems to be working! It was a great post though, so hopefully comes back! Edit edit: it did!