Ping-Pong Kernels on Hopper

Deep Dive on CUTLASS Ping-Pong GEMM Kernel | PyTorch

A useful deep dive on this performance technique. The TL;DR is, using warp specialization, set up a producer groupthat loads data (using TMA), and two consumer groups executing MatMuls on the Tensor core. When a consumer group finishes it executes the epilogue (e.g. copying the results elsewhere, but you could imagine doing something else on a Cuda core) while the other consumer group takes over the Tensor core. Hence, I presume, the name as Tensor core usage ping-pongs between the two consumers!

The producer warp group focuses on producing data movement to fill the shared memory buffers (via TMA). Two other warp groups are dedicated consumers that process the math (MMA) portion with tensor cores, and then do any follow up work and write their results back to global memory (epilogue).

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading