Just got round to reading the intro post to the (now improved) thunderkittens kernel DSL.
https://hazyresearch.stanford.edu/blog/2024-05-12-tk
Many good nuggets on kernel writing in general and the hopper in particular.
But to us a “register” is a 16×16 tile of data. We think AI wants this — after all this time, it’s still just matrix multiplies, reductions, and reshapes. And we think the hardware wants this, too — small matrix multiplies are just begging for hardware support beyond just the systolic mma.
In fact, more broadly we believe we should really reorient our ideas of AI around what maps well onto the hardware. How big should a recurrent state be? As big can fit onto an SM. How dense should the compute be? No less so than what the hardware demands.