Cutie Fly

The FlashAttention 4 paper is out and is fascinating, you should read it! One of the things that Tri called out on Twitter was that the experience of using a Python-based language (CuteDSL) significantly improved the dev loop, not just for him, but for Claude:

Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole…
— Tri Dao (@tri_dao) March 5, 2026

CuTe’s layout algebra plus the quick iteration cycle of a Python DSL are a nice combination. Hence, it’s not too surprising that late last month,AMD dropped FlyDSL, which is, largely, CuteDSL for AMD. This is not a knock on FlyDSL! The project is very open about acknowledging CuTe and its provenance.

To help navigate, here is a handy translation guide:

CuTeDSL: cute.make_layout.
FlyDSL: flir.make_layout.
CuteDSL: cute.composition.
FlyDSL: flir.composition.
CuteDSL: cute.zipped_divide.
FlyDSL: flir.zipped_divide.
CuteDSL: cute.make_tiled_copy_tv.
FlyDSL: flir.make_tiled_copy_tv.

FlyDSL also calls out Colfax’s paper from earlier this year: Categorical Foundations for CuTe Layouts. This paper, along with the Integer Set Relations one from Nvidia last year, really started to establish a mathematical formalization of what had been going on in CuTe layouts. This kind of foundation enables verifying the approaches taken in fresh implementations, like FlyDSL’s.

We can actually go see that, as the whole compiler is open source. This lets you compare the composition_impl in FlyDSL to the diagrammatic version in (section 4.1.3) in the Colfax paper to understand why it works!¹

Given the blistering pace of layout algebra, we shouldn’t be surprised that just a few days after, Cris Cecka of Nvidia dropped a beastly preprint: CuTe Layout Representation and Algebra:

Colfax Research [19] analyzes CUTE layouts and some operations on them in the context of category theory. In this paper, we intend to provide a more definitive and formal treatment of CUTE concepts and their applications.

Sometimes with this kind thing it doesn’t matter if you have the idea, it’s often specific implementations of that that end up defining the standard for it. I read this paper as Cecka planting his flag and saying “y’all, this is the real CuTe”. And he cuts no corners.²

Cecka reframes layout algebra as a theory of loop transformations, showing that the objects you are transforming (Shapes Strides) and the things you are transforming them with (Shapes Strides) are the same^.3

One of the cleverest results of this is in Section 2.3.1. Cecka demonstrates that strides don’t have to be just… regular strides. If your stride is in fact a coordinate then each “step” in the stride moves in the coordinate dimension.

This is, for example, what you need for TMA on Hopper or Blackwell: you tell it the logical position in the tensor and it figures out the physical address internally, handling tiling, swizzling and bank conflict avoidance in hardware. If you stride over coordinates, you can use exactly the same layout algebra as for your computations.

Another example was that if a Stride is a bitmask you get something very like Triton’s LinearLayouts!³ That lets you compose layouts with swizzling, using the same composition operators again.

The paper is full of these interesting, but also practical, results. Cecka gives guidance on optimizations like ‘avoid ranged slicing’; (e.g. a[2:4, 1:3]) as it mixes up tile size (an optimization knob) and thread ID (a runtime index)⁴, or use layouts to algebraically work out how to auto-vectorize loads and stores rather than hard coding⁵.

There is something satisfying about paper on composition that itself pulls together ideas floating around CUTLASS internals, preprints, and alternative implementations, then shows they are all views of the same object. This will help projects like FlyDSL, Triton, and any number of other authoring libraries ground their management of one of the most painful aspects of kernel dev in a way that should make life easier, for everyone.

I think! My understanding of category theory is similar to my understanding of Skibidi Toilet: I get the idea, but I have so many questions. ↩︎
As an example, Cecka provides a wider generalization than the Colfax paper, demonstrating that CuTe layouts are not strictly closed under group composition: you can’t always compose layouts however you want. But! The failures correspond to real errors, which is the kind of restriction you actually want. ↩︎
Actually, you do a bit better: being strictly in F₂ means Linear Layouts are limited to powers of 2, which it turns out is a bit limiting. ↩︎
This makes it harder for compilers to separate static and dynamic elements. CuTe, and Fly, do this in two stages: zipped_divide to tile. then slice by a dynamic bid, allowing the compiler to optimize (e.g. constant-fold) the static tile parameter. ↩︎
By composing the layout with the right-inverse of the other, apparently! Or calling max_common_vector(src_layout, dst_layout) ↩︎

Cutie Fly

More posts

Unbundling Work

Native DSLs Ops in PyTorch

Programming The Loop

Cutie Fly

Cutie Fly

More posts

Unbundling Work

Native DSLs Ops in PyTorch

Programming The Loop

Cutie Fly

Discover more from Ian’s Blog