Colfax on Blackwell GEMMs

Written by

in

CUTLASS Tutorial: Writing GEMM Kernels Using Tensor Memory For NVIDIA® Blackwell GPUs – Colfax Research

Dives deep into TMEM into particular, and the trend over the last few Nvidia generations of special-casing GEMMS in hardware:

Tensor Memory and UMMA do for MMA just what TMA did for copy, making it a single-threaded, asynchronous operation that does not consume registers. As a result, registers can primarily be used for other tasks like scheduling and fused epilogue operations.

~~Edit: link no longer seems to be working! It was a great post though, so hopefully comes back! Edit edit: it did!~~

More posts