There are a lot of things folks do on GPUs (including, sometimes, graphics) so I have an approximately-correct taxonomy of operations to group them in to:
- Dense compute: A matmul or a convolution.
- Map: Elementwise/pointwise work on each value of a tensor.
- Reduce: Process a tensor into fewer dimension, like a sum.
- Transforms: Change data structure. Easy ones like transpose, annoying ones like scatter/gather.
- Synchronize / Communicate: Move data, or wait for it (copies, collectives, fences/barriers).
At the moment people are pouring billions of dollars into hardware that primarily does 1. And, at the same time, many of the greatest minds of our generation are attempting to ensure that the hardware spends as much time doing 1 as possible.
The biggest barrier to doing a lot of dense compute is 5: moving things in and out of memory. Launching kernels, transferring data between host and device (or between devices), moving data between global memory and registers and so on. It’s like there’s a box defined by Data × Footprint 1 × Time, and everyone is trying to keep it as full as possible.
This involves tradeoffs. You want to mul as many mats as you can, but you only have so much room to store accumulators. Fetching new data from memory also takes a while. You can keep many in-flight fetches around, but each one expands the kernel Footprint, lowering occupancy.
There are 3 tricks that we can use to help fill up the box by stitching different operations together:
- Epilogue fusion: take an elementwise op and fuse it onto the end of a dense op, so that when the MMA produces output, the elementwise op can be run while the output data is still in registers. A classic example: fuse the activation after the dense compute in a feed-forward net.
- Vertical fusion: take two subsequent operations and chain them together to avoid running a loop for one, writing it back, then running a loop for the other2. A classic example is Fused LayerNorm: normally you’d need to add elements, then collect stats for the normalization. You can fuse the two to collect the stats as you add the residual.
- Horizontal fusion: doing different things over the same data, in parallel. The
Q,K, andVprojections in a transformer all need the exact same input, so are good candidates to fuse horizontally.
You rely on the design of the hardware to enable some of this. For example, an epilogue fusion is beneficial because it’s one kernel launch instead of two, and because the work doesn’t need to be written back to global memory, but also because the epilogue can overlap with other work.
It’s not always obvious how to put these fusions together. Flash Attention was such a breakthrough because it made dense op fusion possible. The naive attention block has a softmax in the middle: Softmax(QK^T / √d) · V. That softmax is a reduction op, which means it needs all of QK^T to be computed first, a pretty large matrix. Tri Dao and colleagues realized that if you used online softmax you could calculate the softmax for subsets of the QK matrix, and avoid materializing the whole thing. They enabled fusing the QK into the softmax and the V in one kernel, at the tile level.
Tiles are the subsection of a matrix you’re working on at any given time. In a matmul, tiles from both input matrices are loaded and multiplied, to produce an output tile. There’s a useful image of this in the Nvidia blog post on cuTile, Nvidia’s most recent entrant into the the kernel-development landscape. To side-step concerns of plagiarism, I had nanobanana plagiarize it for me:

cuTile is built on a well-specified intermediate representation called TileIR. There’s an experimental backend for Triton that lowers to TileIR too. While Triton is block-oriented rather than tile-oriented, in practice what you mostly work on in a thread-block is… a tile. TileIR elevates the tile to a first-class concept.
You can see this by generating the same kernel against the regular backend and the TileIR backend. Triton’s intermediate representation (TTIR) uses pointer arithmetic: generating offsets, computing masks, loading from explicit addresses. Here’s a bit of an inner loop of a matmul. It groups up which data it wants, loads the tiles a and b by pointer, and computes the dot product:
%offs_m = tt.make_range {end = 128, start = 0} : tensor<128xi32>%a_ptrs = tt.expand_dims %offs_m {axis = 1} : tensor<128xi32> -> tensor<128x1xi32>%a_ptrs_1 = tt.splat %stride_am : i32 -> tensor<128x1xi32>%a_ptrs_2 = arith.muli %a_ptrs, %a_ptrs_1 : tensor<128x1xi32>...scf.for %k = %c0 to %num_k step %c1 iter_args(%acc = %zero) -> tensor<128x128xf32>: %a = tt.load %a_ptrs, %mask : tensor<128x64x!tt.ptr<f16>> %b = tt.load %b_ptrs, %mask : tensor<64x128x!tt.ptr<f16>> %acc_new = tt.dot %a, %b, %acc : tensor<128x64xf16> * tensor<64x128xf16> -> tensor<128x128xf32>
TileIR on the other hand preserves the tile as a semantic object. This snippetis doing exactly the same thing, but this representation elides the pointer math and masking:
$33: Tile[float32,(128,128)] = typed_const(value=0)accumulator.2: Tile[float32,(128,128)] = for k.1 in $36 (with accumulator.0 = $33)do $50: Tile[float16,(128,64)] = tile_load_token_ordered(array=A, index=($7, k.1), shape=(128,64)) $64: Tile[float16,(64,128)] = tile_load_token_ordered(array=B, index=(k.1, $10), shape=(64,128)) $72: Tile[float32,(128,128)] = tile_mma(x=$50, y=$64, acc=accumulator.0) continue $72
This is a nice IR (compact!), but from my perspective the most interesting part is that load function: tile_load_token_ordered.
The “time” dimension of the Data × Footprint × Time box is the hardest one to manage. Time questions separate performant kernels from slow ones: When to prefetch, how to overlap loads, and so on. Since the advent of warp specialization, the Triton compiler has been exploring pipelining options through heuristics and autotuning, and kernel engineers have been going straight to the hardware with explicit barrier API extensions like TLX3 and Gluon4 .
TileIR goes a somewhat different route. It assumes an unordered memory model: the order your code is written in does not determine when data is actually available. Instead, each memory operation returns a token and you attach semantics to it: read-only, write-only, read-write and so on.
By being explicit about memory dependencies you give the compiler freedom to manage the Time dimension. Where accesses don’t overlap the compiler can freely reorder them. Where they do, the token chain tells the compiler exactly what depends on what. The kernel expresses intent; the compiler maps that to the hardware.
TileIR is (mostly) targeting Blackwell right now, and the experimental backend is still early. The open question is whether we can express this smoothly enough in the syntax of kernels to actually enable taking the same kernel across hardware, or whether we are just adding some syntactic-sugar to avoid when doing hardware-specific tuning.
That said, the idea feels pretty right? The tile is the unit with which we can express what we actually mean about memory, ordering, and fusion. The CUDA programming model was always about bounded-linearity within a massively parallel framework, and this loosens the bounds that little bit more.