Linear Layouts in Triton

[2505.23819] Linear Layouts: Robust Code Generation of Efficient Tensor Computation

Paper from the Triton folks at OpenAI on their solution to the layouts/data movement problem. Data often needs to be laid out in a specific way to maximize performance on a GPU. This includes certain instructions, and also avoidance of bank conflicts in shared memory. You might have data stored nicely in global memory, need to permute it to load, then permute it again for execution.

Part of the appeal of CuTe is expressing these layouts and allowing a relatively simple algebra to transform it between these domains. This works, but the Triton approach is to try and hide this type of complexity, particularly hardware specific complexity, in the compiler.

While both CUTE and linear layouts aim to address the challenge of flexible task mapping on emerging architectures, they differ in several key aspects. First and foremost, CUTE is primarily designed for users to manually describe layouts, whereas linear layouts are integrated into a compiler. Second, the linear algebra framework of linear layouts enables compilers to generate efficient code for layout conversion and code lowering for many common operators, which is absent in CUTE. Third, swizzling is inherently defined within linear layouts, whereas in CUTE, it is treated as a separate step

The clever insight is that you can represent any of the layouts as a binary matrix over F₂, which means you can use XOR/AND for arithmetic. You can compose those binary matrices freely, and it’s also easy to replace the transform matrix with a new one for hardware that requires a different permutation.

To give a step-by-step example (as I’m not totally sure how well I grok this myself!) let’s say we are working on am MMA for a 16×8 tile:

  • We start with our data, say in row major order (0,0), (0,1), …, (0,7), (1,0). Each value is stored in its own register
  • We have 32 threads, each managing their own section of the block: in this case 4 registers
  • So we have a hardware location for each value: the thread (0..31) and the register (0..3). You can imagine this as 7 bits of data, thread ID (5 bits), and register ID (2 bits)
  • Equivalently we have imagine tracking the tensor location for each value: 4 bits for 0..15 rows, 3 bits for 0..7 columns
  • We can have a map which translates between tensor location and hardware location: block location row 1 col 0 is in thread 2 register 0. This would be a 7 by 7 binary matrix
  • We can define a matrix that transforms the hardware map to the one needed for our ldmatrix tensorcore call.
  • For example, we might need thread 0 to manage tensor values (0,0), (4,0), (8,0), (12,0)
  • If the mapping requires moving a value to a different register in the same thread we can use a prmt (permute) instruction
  • If the mapping requires moving values between thread’s registers, we can use a warp shuffle like shfl.sync that allows swapping registers between threads without using shared memory1

Triton has layouts for standard block level storage, and for MMAs and other operations. By multiplying through the required mappings it can automatically work out how best to optimize movement, versus the manual transforms you do in CuTe!

It also has versions of these mappings for different hardware, so for many operations only the layouts need to be swapped out when moving from Ampere to Hopper or Blackwell!

  1. mostly. if there will be bank conflicts, it will spill to shared memory. ↩︎

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading