Keeping a GPU busy is a lot about tiling

File this under the “gross oversimplifications” category. The basic approach to keeping GPUs busy is dividing the work into tiles, smaller sub-problems that make up the larger result. For a GEMM you might break the matrix into 128×128 or 128×64 tiles and let each CUDA thread block (CTA) own one tile. The GPU has many streaming multiprocessors (an A100 has 108) and every SM picks up one CTA at a time. If you want to know how many SMs your own card has you can call:

props = torch.cuda.get_device_properties(0)
print(f"SMs: {props.multi_processor_count}")

Tiles are launched in waves. A full wave is the moment when every SM is busy with exactly one CTA. If the total number of tiles isn’t a multiple of the SM count, the final wave is only partly full and some SMs sit idle; Nvidia calls that wave quantization. There is a similar problem at the edge of the matrix: if the dimensions aren’t multiples of the tile size the right-most or bottom-most tiles are partly empty, wasting threads (tile quantization). Sometimes a smaller tile size (for example 64 × 64) gives higher overall throughput because it leaves less unused space at the edges.

The usual cure for poor wave utilization is a persistent kernel. Instead of launching one CTA per tile, you launch (roughly) one CTA per SM and have each CTA pull tiles from a global queue until the queue is empty. Because each CTA is pulls whenever ready, the SMs rarely go idle and the tail effect is reduced.

Inside an SM the main performance lever for GEMMs arethe Tensor Core, which execute matrix-multiply add (MMA) instructions efficiently. On Ampere you use WMMA instructions: one Warp (32 threads) computes a 16 × 16 fragment at a time. Hopper introduces WGMMA instructions where four warps acting in ia warp-group (128 threads) execute a larger matrix multiply (up to 64 × 64 for FP16/FP8). To issue WGMMA you must place the right-hand operand B in shared memory; A can sit in either registers or shared memory. The operation is asynchronous, so while a warp-group is processing one tile the same CTA can be pre-loading the next tile.

Blackwell pushes the idea further. A pair of CTAs on neighbouring SMs can cooperate in a pair unified MMA, letting two SMs’ tensor cores process an even larger tile.

To make that possible Hopper introduced thread-block clusters and Blackwell extends them. When you launch a kernel you can group CTAs into clusters such that the scheduler guarantees to place them on SMs inside the same GPC (GPU Processing Core), so they share a fast interconnect and can access shared memory across SMs. If the grid doesn’t divide cleanly into whole clusters you also lower efficiency on the tail (is this cluster quantization? stick with the trend Nvidia!) so Blackwell has a Cluster Launch Control that can shrink the last cluster to better fit the work.

Loading Data

All of this only works if data is present in shared memory. The first optimization is making sure (global) memory access is coalesced. A 32-thread warp can request 32-byte chunks , but the memory bandwidth for a single fetch from DRAM is wider. e.g. If four consecutive threads request address 1, 4, 8 and 12, the memory controller can coalesce these into a single 128-byte read. If the addresses are strided (e.g. hopping across rows) then only 32 bytes out of the 128 byte fetch capacity is loaded at a time, so the load takes longer. Getting this right is about ensuring the memory layout is set up for the kernel, and doing any transforms needed in shared memory before executing.

In older GPUs the warp had to wait on the copy operation. Ampere enabled cp.async plus non-blocking wait/arrive barriers so a warp can initiate a copy from global to shared memory and immediately continue with arithmetic. Hopper adds the Tensor Memory Accelerator: with TMA, a single thread in the CTA can describe a multidimensional block to copy and the TMA hardware streams it to shared memory while the threads do something else. Blackwell goes one step further and can multicast a single TMA load into every SM of a cluster, which is helpful when multiple CTAs are about to reuse the same B tile.

In practice you hide latency by organizing the main loop using so that it double buffers: while the tensor cores work on tile k the TMA or cp.async engine is fetching tile k + 1 into the other half of shared memory; then you swap buffers and repeat. As long as copy time and compute time overlap well, the tensor cores and the copy engines stay saturated.

Choosing the right tile size

Choosing the right tile size (often expressed in Triton as BLOCK_M × BLOCK_N) is a balance between each of these: enough threads to issue a warp-group MMA, small enough tiles that the matrix edges aren’t mostly padding, enough shared-memory space to double-buffer, and a grid size that fills whole waves or is run via a persistent kernel. Autotuning in Triton or CUTLASS can empirically test different options on the hardware, but it helps to have the right mental model about what sets of sizes they should consider. One good clue that you’re missing an option is when you see a sudden drop in achieved TFLOP/s for particular shape.

AMD

AMD’s MI300X hardware takes a somewhat different route. The GPU is divided into chiplets, where each chiplet has its own compute units and multiple schedulers that schedule wavefronts (AMD for warps, 64 threads rather than 32) independently, so the hardware load-balances multiple kernels by itself. Matrix instructions run at the wavefront level; there is no cross-CU equivalent to WGMMA. Latency hiding relies on launching a large grid of workgroups and letting the hardware interleave them, rather than on explicitly scheduling async copies. On AMD the guidance is to mostly focus on high occupancy and coalesced memory access, whereas on NVIDIA there is value in crafting (by hand or compiler) the copy–compute pipeline.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading