Tag: ml

  • Cutie Fly

    Cutie Fly

    The FlashAttention 4 paper is out and is fascinating, you should read it! One of the things that Tri called out on Twitter was that the experience of using a Python-based language (CuteDSL) significantly improved the dev loop, not just for him, but for Claude:

    CuTe’s layout algebra plus the quick iteration cycle of a Python DSL are a nice combination. Hence, it’s not too surprising that late last month,AMD dropped FlyDSL, which is, largely, CuteDSL for AMD. This is not a knock on FlyDSL! The project is very open about acknowledging CuTe and its provenance.

    To help navigate, here is a handy translation guide:

    • CuTeDSL: cute.make_layout.
      FlyDSL: flir.make_layout.
    • CuteDSL: cute.composition.
      FlyDSL: flir.composition.
    • CuteDSL: cute.zipped_divide.
      FlyDSL: flir.zipped_divide.
    • CuteDSL: cute.make_tiled_copy_tv.
      FlyDSL: flir.make_tiled_copy_tv.

    FlyDSL also calls out Colfax’s paper from earlier this year: Categorical Foundations for CuTe Layouts. This paper, along with the Integer Set Relations one from Nvidia last year, really started to establish a mathematical formalization of what had been going on in CuTe layouts. This kind of foundation enables verifying the approaches taken in fresh implementations, like FlyDSL’s.

    We can actually go see that, as the whole compiler is open source. This lets you compare the composition_impl in FlyDSL to the diagrammatic version in (section 4.1.3) in the Colfax paper to understand why it works!1

    Given the blistering pace of layout algebra, we shouldn’t be surprised that just a few days after, Cris Cecka of Nvidia dropped a beastly preprint: CuTe Layout Representation and Algebra:

    Colfax Research [19] analyzes CUTE layouts and some operations on them in the context of category theory. In this paper, we intend to provide a more definitive and formal treatment of CUTE concepts and their applications.

    Sometimes with this kind thing it doesn’t matter if you have the idea, it’s often specific implementations of that that end up defining the standard for it. I read this paper as Cecka planting his flag and saying “y’all, this is the real CuTe”. And he cuts no corners.2

    Cecka reframes layout algebra as a theory of loop transformations, showing that the objects you are transforming (Shapes Strides) and the things you are transforming them with (Shapes Strides) are the same.3

    One of the cleverest results of this is in Section 2.3.1. Cecka demonstrates that strides don’t have to be just… regular strides. If your stride is in fact a coordinate then each “step” in the stride moves in the coordinate dimension.

    This is, for example, what you need for TMA on Hopper or Blackwell: you tell it the logical position in the tensor and it figures out the physical address internally, handling tiling, swizzling and bank conflict avoidance in hardware. If you stride over coordinates, you can use exactly the same layout algebra as for your computations.

    Another example was that if a Stride is a bitmask you get something very like Triton’s LinearLayouts!3 That lets you compose layouts with swizzling, using the same composition operators again.

    The paper is full of these interesting, but also practical, results. Cecka gives guidance on optimizations like ‘avoid ranged slicing’; (e.g. a[2:4, 1:3]) as it mixes up tile size (an optimization knob) and thread ID (a runtime index)4, or use layouts to algebraically work out how to auto-vectorize loads and stores rather than hard coding5.

    There is something satisfying about paper on composition that itself pulls together ideas floating around CUTLASS internals, preprints, and alternative implementations, then shows they are all views of the same object. This will help projects like FlyDSL, Triton, and any number of other authoring libraries ground their management of one of the most painful aspects of kernel dev in a way that should make life easier, for everyone.

    1. I think! My understanding of category theory is similar to my understanding of Skibidi Toilet: I get the idea, but I have so many questions. ↩︎
    2. As an example, Cecka provides a wider generalization than the Colfax paper, demonstrating that CuTe layouts are not strictly closed under group composition: you can’t always compose layouts however you want. But! The failures correspond to real errors, which is the kind of restriction you actually want. ↩︎
    3. Actually, you do a bit better: being strictly in F₂ means Linear Layouts are limited to powers of 2, which it turns out is a bit limiting. ↩︎
    4. This makes it harder for compilers to separate static and dynamic elements. CuTe, and Fly, do this in two stages: zipped_divide to tile. then slice by a dynamic bid, allowing the compiler to optimize (e.g. constant-fold) the static tile parameter. ↩︎
    5. By composing the layout with the right-inverse of the other, apparently! Or calling max_common_vector(src_layout, dst_layout) ↩︎
  • Everything MoE

    There are two really good ways to learn the deep fundamentals of a field. One we could call the Carmack/Ilya method: get an expert to give you a list of the seminal papers, systematically work through them, and in the process develop a deep, grounded intuition. This seems to work. The second is: funny tweets.

    A case in point:

    Other than the fact you have to be in a very particular niche in order to understand all the acronyms in that tweet, the idea that everything is an MoE feels right? Pretty much every notable model release, and probably all the secret frontier models, are MoE.

    Like every other idea in deep learning this goes back to something Hinton did in the 90s, specifically the paper Adaptive Mixtures of Local Experts by Jacobs, Jordan, Nowland and Hinton:

    If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor generalization. If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the experts should be used for each training case. […] The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to these experts (and the gating network).

    The idea is that if your data naturally clusters, then having separate networks avoids smearing understanding across the weights. A dataset with both German and English training data might produce a model that mixes up both languages. If we train two different experts and learn a gating network, we can get a clean “German-speaking” model, and a clean “English-speaking” model, in one.

    Also, like every other idea in deep learning, this was very clever, but painful to train. In particular, this was because the decision about which expert to choose was a bit of a cliff. If you choose the German expert when you needed the English expert then the German expert would get some loss, but the English expert would get none. This could lead to the awkward situation where the German expert performed better for both English and German: you ended up with a smaller, smeared model, and a dead expert.

    Noam Shazeer and co came to the rescue in 2017 with the excellently titled “Outrageously Large Neural Networks”. They introduced concepts that didn’t fundamentally change the approach, but did make it practical.

    The key trick was adding an auxiliary loss that penalized the model for using one expert over the others. By adding some noise to the gating decision they helped it be differentiable and ensure errors could flow back effectively. This gave the training process a much better chance of avoiding this kind of “winner-takes-all” collapse.

    Over time these methods were refined. In a contemporary MoE like DeepSeek v3, sigmoid-based routing removed the noise from the gating and the auxiliary loss is replaced in favor of a what they call bias updates: they just put their thumb on the scale during training if some experts aren’t getting enough samples, which seems to work great.

    All of that is about how we got MoEs to scale, but doesn’t really say… why? Intuitively, if you can train a model with X parameters, it seems like it would be better to have all of them doing something (a dense model), rather than some subset1?

    The main reason this has taken over the field is it is a way of decoupling capacity (how much can the network “know”) from compute (how much work does it do for each input).

    In a dense model when you add a new token to train you send it to all parts of the model: every bit of capacity touches it, each of which uses some compute to process. MoEs are a form of sparsity: a way of ignoring some of the parameters. They let you add capacity without adding compute2.

    There are other ways of achieving the same result, but the MoE approach is very hardware friendly. You’re still mostly doing dense matmuls, just split between experts. In parallelism terms, Expert Parallelism is efficient because you’re moving tokens between devices: it needs an all-to-all, but the data volumes are manageable.

    The tweet calls out NSA, engram and mHC, all recent papers from Deepseek. But underneath it calls out the design pattern: make a few alternative compute or memory paths, then use a learned gate to pick (or mix) a subset of them, per token. You get sparsity at the routing level, decoupling formerly coupled aspects, while each path can remain fairly dense and hardware-friendly.

    Engrams makes the argument that language models have to do two things: reasoning and looking stuff up. The reasoning works great with stacks of Transformers, but the looking-stuff-up part is approximated through computation rather than just… looking stuff up.

    This process essentially amounts to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning.

    Classically, Natural Language Processing used a lot of N-grams: representations of more than one token at a time, but language models pretty much dropped that in favor of a fixed vocabulary. Deepseek is bringing it back. These extra embeddings are retrieved for subsets3 of the tokens in the context window, the resulting vectors are summed4, then the model gates how much to incorporate the information based on the current state.

    It’s the same move of decoupling compute and capacity. Here they are adding a bunch of extra storage parameters but letting the model learn whether or not to use them. Because the retrieval is based on tokens the table doesn’t have to live in VRAM but can be loaded with the input5 .

    The second paper, Manifold-constrained Hyper Connectors is the most math-heavy of the recent release, and it builds on one of the most cited papers in ML: ResNet.

    In the bad old days ,the “Deep” in Deep Neural Nets didn’t really exist: you could theorize, but if you tried to train one you’d get into a place where the early layers received basically no useful loss signal. ResNets fixed this in the simplest way possible: As well sending through the “output” of a layer, you sent through the input as well. This gave an efficient highway for loss gradients to flow back and enabled successfully training much, much deeper models.

    mHC builds on an observation that ResNets hard-code another compute/capacity tradeoff: the size of the residual channel. If you think of a layer of a transformer: it has an input of C tokens, and an output the same size. The residual connection works by summing the input tokens and the output tokens. That’s assigning as much information capacity to the residual channel as you do to the processing channel. E.g.

    • Layer 0 gets raw tokens, and outputs a sum of raw+contextualized tokens
    • Layer 1 gets layer 0 tokens and outputs a sum of layer0+contextualized tokens
    • Etc.
    • At the end you get a cake recipe

    But maybe that cake recipe would be better if Layer 2 had access not just to the layer0 tokens, but also to the raw tokens? We don’t really have a way to express that outside of adding extra skip connections. Hyper Connections widen the ResNet channel into multiple lanes, and mHC lets the model decide what to put in each: so you could have layer 1 putting layer0 context in one lane, and raw tokens in another lane6 . If MoE lets you take a bunch of parameters and selectively route tokens to a subset, then mHC lets you take a bunch of residual bandwidth and selectively mix the information flow from your module to a subset of it.

    Finally, Native Sparse Attention follows the classic Deepseek move of throwing a bunch of engineering wins together. Instead of assuming the amount of attention compute for each token in is the same they are scaling it dynamically based on the content itself. They mix the outputs of a pooled version of the content window to get a compressed representation, a MoE-style gated selection from the full context window7, and a classic sliding window attention.

    This is the pattern MoE exemplified:

    • look at what is constrained
    • add more of it, but make it conditional to avoid scaling other things at the same time

    It’s a thread that runs through an awful lot of the industry right now. Understanding that is useful when anticipating where the things are going to go next.

    Or, you could have saved yourself a lot of time and just liked the tweet.

    1. MoEs do have some inference advantages: if you have a 100bn parameters model where just 20bn are active for a given token you simply have to do less work than a 100bn param dense model. That’s a win for latency! But, you still have to store all those 100bn parameters, meaning you need quite a lot of memory kicking around. ↩︎
    2. More specifically, they make the ratio of adding capacity and adding capacity very flexible: modern MoEs often have many experts and activate several at a time. ↩︎
    3. In this case Deepseek uses 2-gram and 3-grams ↩︎
    4. Weighted summed ↩︎
    5. In practice they inject the ngram embeddings at a couple of different points later in the model, where empirically there seemed to be enough context for the model to make useful mixing decisions ↩︎
    6. The specific clever thing the Deepseek folks added was a constraint to stop this from exploding, using the wonderfully named Sinkhorn-Knopp algorithm (apparently) ↩︎
    7. Based on those pooled tokens. Effectively its taking the “summarized” context window, and using runtime gating to decide which bits of the context window to add in full. ↩︎
  • Megacores

    Megacore - Systole as a 80s metal album cover.

    What we do in machine learnings owes a lot to the history of computer graphics. Folks like Kurt Akeley, one of the founders of SGI, identified that 3D graphics have a naturally pipelined structure. You have a high volume of similar operations, such as applying pixel-y soldier textures to a mesh of triangles, and by pipelining them you can find an opportunity for a high degree of parallelism.

    Akeley was one of the drivers of OpenGL, which provided a standard interface to that pipeline, and later worked with Nvidia on CG, a realtime shader language and compiler. Shader languages, as used in Pixar’s RenderMan and other non-realtime 3D use cases, introduced an approach where you could manage lighting programmatically by describing the transforms to each individual element. The shader would be run in parallel across all the geometry or pixels it was addressing.

    With CUDA, Ian Buck and others at Nvidia helped formalize what had been true in the hardware for a while: GPUs were massively parallel processing machines, not just polygon factories. CUDA was part of a move from the supercomputer approach of Single Instruction Multiple Data (SIMD) to Single Instruction Multiple Thread (SIMT). On a Cray or other vector oriented processor you had to pack the work into a vector. CUDA let programmers familiar with CPU threads think in those terms instead. Under the hood, the threads in a warp were executed in lockstep, but they could be masked off to allow for divergence. It was flexible, fast, and attracted the attention of the machine learning community. Because so much of ML is large matmuls, Nvidia bolted on Tensor Cores as specialized co-processors that handled blocks of matrix math efficiently. This combination of performant hardware and flexible software helped make Nvidia the most valuable company in the world, and drive up house prices across the Bay Area.

    But, it transpires, not everyone loved shoveling their margin to Jensen, and they looked for more cost-efficient ways to run ML workloads. The flexibility for threads to branch, pause or switch requires infrastructure and silicon. You need big register files per core, multiple levels of memory to cache, and logic to manage swapping in and out warps.

    If you look at the “do the math” parts of a chip, a CPU probably only spends about 10% of silicon on that, with the rest managing the chaos of running an operating system: branch prediction, caching, data movement. A GPU, in contrast, is a wildly efficient machine, with maybe 30-40% of the silicon dedicated to mathing effectively.

    When Google looked at the problem of running inference at their scale back in the dark ages of 2016 they wanted to spend as much of their budget as possible doing the math, to keep the costs as low as they could. The chip they created, the Tensor Processing Unit (TPU) recently hit its 7th iteration and SemiAnalysis published an extensive breakdown on it: TPU v7 Ironwood, quickly followed up with a deep dive Amazon’s Trainium v3.

    Trainium3 takes a similar approach to Trainium2 and Google’s TPU and builds the chip out of a small number of large NeuronCores. This contrasts with GPU architectures like Nvidia and AMD’s, which instead uses a large number of smaller tensor cores. Large cores are typically better for GenAI workloads since they have less control overhead.

    Dylan and his team are touting these as the first chips to genuinely threaten Nvidia’s moat. The big frontier labs seem interested, with deals and investigation from Anthropic, OpenAI, Meta and others. As the piece repeatedly points out, if you want to understand the dominance of Nvidia you have to focus on the system, and not the microarchitecture. So, of course, I want to talk exclusively about the microarchitecture here.

    TPU, Trainium, as well as other custom approaches like Meta’s MTIA1 lean on an approach called Systolic Arrays. As a recap, Nvidia’s Streaming Multiprocessor (SMs), AMDs compute units ,and so on are cooperative multiprocessors. They access registers, talk to caches and handle the flow of data. Threads can request data if it’s not ready and the hardware warp schedulers will swap in another piece of work to keep the chip humming.

    Systolic arrays are different. The name comes from systole, the phase where your heart pumps blood. In a systolic array, you load your data once and fire it through a grid of Processing Elements (PEs). Each element maths its math then passes the result to its neighbor on the next clock tick.

    This was very much in line with the needs of the original TPU: load a set of model weights up, then pump user requests through as efficiently as possible. TPUv1 only supported int8: it was a low-bit, high-efficiency matmul machine. The data flow needed to be pre-determined: you set it up and make it go, which made it incredibly silicon efficient. You don’t need lots of caches or schedulers, and in fact the original TPU didn’t have any at all!

    The con of course was that you have to get it right! If the data isn’t there to pump in, the whole thing just waits. There is no backup plan to another warp, no other threads. Not only that, but because the systolic arrays are generally a lot bigger (say 256×256 vs the Tensorcores 16×16), you have fewer of them. While an Nvidia GPU might have more than 100 SMs, a Trainium v3 has 8 cores, and a TPU has just 2. Each core is a lot larger, and wasting it gets a lot more expensive.

    Presumably Jeff Dean just programmed these right the first time, but for the rest of Google (and later the world) they spent years building XLA (Accelerated Linear Algebra), a full-graph compiler. In GPU kernel programming the challenge is hiding memory latency and managing register pressure. On a TPU-type approach, there is one massive VMEM that fulfills a similar role as the registers and no memory hierarchy, but you can’t rely on the hardware to swap between jobs. XLA needs to know exactly how the graph works so that it can schedule the right data at the right time.

    TPUs used a VLIW architecture: Very Long Instruction Words. Rather than a traditional instruction set with diverse instructions, VLIW lets you bundle Very Long packages of instructions into single units (kind of a silicon equivalent of German) which execute operations on each of the different units of the core at the same time. This was introduced in TPU v2, and its where the pressure on the compiler really multiplied.

    To draw a GPU analogy, if you think about something like a Relu(AxB+C) you have a graph of operations: AxB -> Result, Result + C -> Result2, Relu(Result2). To optimize that you could use an CUDA graph to compile it into single kernel dispatch and CPU/GPU communication. One step further would be kernel fusion: keep all the intermediate results in registers and write one kernel that avoids the back and forth to higher tier memory. That lets you bundle up even more , but you have to have even higher confidence in the sizes involved to avoid running out of registers,

    VLIW is like parallel kernel fusions: a TPU v2 had 2 matrix units, 2 vector units, 2 scalar units and 2 memory load/store units2.To keep them busy every step the compiler needs to plan ahead enough to give each of them something useful to do. VLIW instructions bundle those ops along with any constants needed into a single instruction. Fusion goes from being an optimization to being a necessity. Once you get it though, you can spend more like 50-60% of your silicon on the part you care most about, and that translates into an excellent total cost of ownership.

    Does this mean we should all be cancelling our Rubin orders and buying TPUs? I mean, no. But there is some nuance. Choosing between flexible streaming processors or efficient systolic megacores feels drastic, but I think it might not matter quite as much as it seems.

    Research still overwhelmingly benefits from flexibility. You are running experiments, solving bottlenecks and debugging. Nvidia tends to be the big lab tool of choice thanks to the flexibility, the depth of tooling and the general CUDA ecosystem3.

    If you are mainly serving a massive model, it’s worth the investment to lock down all the weirdness and optimize it. That’s where the megacore chips have proved their mettle first, with TPU, Inferentia4, MTIA and others all starting on that side of the house.

    Folks like Akeley and Buck realized that when you’re building a chip you’re really building a programming model. Get that right, and the model can long outlast the hardware. Balancing expressivity with performance is the thing that lets a platform win: who best lets researchers and engineers define the future without fighting the silicon.

    What seems to be emerging isn’t quite the SIMT/CUDA architecture: its something around expressing the dataflow of tiles on the critical kernels5, while relying on a compiler to optimize the larger graph and compute.

    Making sure that you have access to the right software might be more important than trying to perfectly identify which hardware platform is the once and future king. But also, look, the world moves fast and if you get a Prime Day deal on Trainium instances, you should probably just take it. The hardware can and will change and it can always be adopted, as the frontier labs are showing. If we keep hunting for the expressivity we need, as OpenGL, CUDA, Triton and others have over the years, we will keep unlocking the possibilities in whatever hardware is available.

    1. Disclosure: I work at Meta and like these chips a lot, though no one would let me anywhere near any chip design, luckily enough ↩︎
    2. Newer versions have others too, like the sparse cores in TPU v6 and v7 which are basically dedicated embedding management processors ↩︎
    3. With the notable exception of Google themselves, though the Jax-XLA-TPU ecosystem is very rich internally ↩︎
    4. Amazon remain undefeated at naming things ↩︎
    5. From system to VMEM on megacore approaches, from SMEM to registers on GPUs ↩︎

  • Comparative Human Advantage

    Back in 1817 David Ricardo published a very influential theory on an interesting question: Why trade, and particularly why trade when you are better at producing something than other countries?

    He gave an example of England and Portugal, in a world where there were just two goods, wine and cloth. In England it took 100 people-hours to make one unit of cloth, and 120 to make one unit of wine. The Portuguese, on the other hand, took 90 hours to make a unit of cloth and 80 to make a unit of wine. England is worse at making both wine, and cloth, so why trade? Why doesn’t Portugal just make everything for itself?

    Well, it turns out that while England lacked the famed Portuguese efficiency, it was way worse at wine than it was at cloth. England could trade one unit of English cloth for one unit of Portuguese wine, which meant the wine cost them (effectively) 100 person-hours vs 120 they would have needed to make it themselves: a clear win! But Portugal won too: by focusing on wine rather than cloth they could trade 80 hours of work (for the wine) for some cloth that would have cost them 90 hours to make.

    Ricardo described this as a comparative advantage: by leaning into their relative specialties, countries could benefit from trade, even if they are generally more efficient than their competitors. This was a clever insight, globalization happened, and we eventually ended up with Temu.

    Of course, things are never quite as simple as economists’ models (annoyingly to economists the world over), and within his own life there were some interesting wrinkles. Sticking with the textiles theme one of them happened to weavers: people who took thread and turned it into fabric. There was a period, shortly before Ricardo published his theory, that some call the Golden Age of the handloom weaver. Spinning, turning material into threads, had been mechanized thanks to the Spinning Jenny, which made yarn cheaply available. Weavers became the bottleneck to turn that yarn into saleable cloth. Weavers worked from home, controlled their schedule, and made excellent money while doing so.

    What changed next was the power loom1. Using the hand loom required dexterity and practice to master the shuttle and weave, but the power loom just needed someone to mind it and occasionally unjam things. Weaver’s earnings collapsed from around 20 shillings a week in 1800 to 8 shillings by 1820. The power loom enabled turning yarn into cloth efficiently and cheaply, without the need of years of deep skill and practice.

    Ricardo was, at the end of his life, right there to observe the start of this transition, and in the third edition of his book Principles of Political Economy he added a chapter titled “On Machinery”. Comparative advantage says that if a machine comes out that is better at some job humans should move to a place where they are comparatively better (like fixing the machine). Ricardo realized that machinery could increase the profit for the factory owner while decreasing the gross income to workers: it shifted returns from labor to capital. The power loom took the primary asset of the weavers, their dexterity and practice, and made it economically irrelevant.

    This feels worth discussing because in many ways software engineering has been going through a Golden Age of the handloom coder, particularly in the post-pandemic expansion from 2020-2022, where it was a very, very valuable skill indeed.

    While SWE wages have yet to collapse to shillings, there has been a definite cooling through rounds of layoffs and shifts to capital expenditure, accelerated by the adoption of strong coding models. Generating syntactically correct code has become way cheaper, and the bottleneck that was shipping code to production is shifting from writing code to proving it is correct. There is still a huge amount that hasn’t changed: identifying requirements, making choices on implementation paths, and thinking about the overall system, but slinging code is becoming a different job, quickly. The primary beneficiaries so far are those selling the pythonic power looms: the big labs and key tooling and hardware providers.

    In my own direct experience coding assistance went from being a somewhat niche interest, that required regular selling to VPs to keep them investing in it, to a top level company mandate with accompanying metrics. The question I have found myself discussing recently with many smart engineers recently is: are we the weavers, or, you know, is everyone a weaver? Is this another industrial revolution like steam or electricity, or something perhaps even larger?

    Steve Newman of the Golden Gate Institute of AI2 (and one of the creators of Google Docs), wrote up one of the best “maybe it’s different this time” posts I’ve read in a bit, and not just because it involves robots mining Ceres3.

    https://secondthoughts.ai/p/the-unrecognizable-age “Presenting the case the future will be unrecognizable”

    “I spend a lot of time in this blog arguing that AI’s near-term impact is overestimated, to the point where some people think of me as an AI skeptic. I think that predictions of massive change in the next few years are unrealistic. But as the saying goes, we tend to overestimate the effect of a technology in the short run, and underestimate it in the long run. Today, I’m going to address the flip side of the coin, and present a case that the long-term effect of AI could be very large indeed.”

    The core of Newman’s argument is that AI is the first technology we have developed that could, potentially, be more adaptive than we are. As a way of illustrating, let’s stick with what everyone comes to this blog for: 19th century weavers.

    Despite all of the above automation, weavers still had a role in more complex or limited run designs where the expense and effort of setting up a power loom didn’t make sense. Then, the Jacquard loom made the design flexible: you specified the design by punching holes in a card4 and the loom wove the design. The comparative advantage shifted away from weaving entirely, into designing and encoding. Pattern designers became some of the first programmers of mechanical systems as card punchers. The unique human advantage was adaptability: we added a level of flexibility, and the humans then adapted to work above this level

    Newman argues that the AI is a cognitive loom: the power loom replaced dexterity and practice, the Jacquard loom made it flexible and adaptable, but someone still needed to punch the cards. Humans adapted, and learned new skills. Newman argues that AI might be able to learn those new skills faster.

    “My point is simply that once AI crosses some threshold of adaptability and independence, there will be paths around the traditional barriers to change. And then things will really start to get weird.”

    This doesn’t inherently invalidate the idea of competitive advantage, but it might make it practically irrelevant if the market value of the human advantage drops below the cost of subsistence. If a future AGIs opportunity cost is tiny, maybe there just isn’t enough left for humans when it comes to matters of substance.

    Comparative advantage is, fundamentally, about tradeoffs. Technology is our great lever of progress to remove some of those tradeoffs, but we have historically always run into more. Even if we were out mining asteroids with robots and building giant data centers autonomously there is still not infinite compute, and there is still not infinite time. There will always be some set of tradeoffs that have to be made, some range of competing options to choose between.

    What is valuable or notable in that environment can look markedly different. To look at the Victorians again, the art world was significantly impacted by the advent of photography, as (within certain bounds) it effectively solved realism. Artists responded by developing impressionism: the comparative advantage they retained was subjectivity and emotional context. Even the most opium-enhanced Victorian futurist would have to be lucky to predict Cubism from reading about William Henry Fox Talbot.

    Humans do seem to me to have a comparative advantage in some areas, particularly:

    • Reality
    • Desires

    We are grounded as creatures in the world, not in textual or video inputs. We evolved in the world, and are richly adapted to it, in ways that are not always obvious, even to ourselves.

    We also tend to view intelligence as being coupled to wanting things, because things notably less intelligent things than us seem to want things, and we certainly have any number of desires. It might be true that an AGI wants things, but it’s not clear that it must be true. I feel even more confident that on the way to AGI we will build some pretty powerful systems that don’t really “want things” in the same way we do: they may be agentic, but they are not truly agents with goals absent human input.

    Since we are already living in part of that future, I asked Gemini what it thought might be the human comparative advantage. As I hoped, it told me I was absolutely right:

    “Since we (AIs) are designed to serve human intent, the scarcest resource for us is accurate data on human preference. If you can predict what humanity will value in 10 years (e.g., “Will we value privacy or convenience more?”), that information would be incredibly valuable to a superintelligence trying to optimize its resources.”

    In a world of tradeoffs there will still have to be choices, and many of those choices are not easily, observably optimizable. Our ability to be in the world and have preferences might be the most valuable aspect of us after all. Maybe the role of the software engineer of the future, or perhaps of people of the future, isn’t so much doing work or even managing work, it’s instead curating the work.

    One example of that kind of activity is a DJ: they create a vibe by arranging songs based on their taste and the response of the audience. Folks choose to go to certain DJs not because they are objectively better, but because they are who they are.

    This might sound a bit silly, but in practice much of modern work is not so much about doing the thing as it is about doing the thing a certain way. Still, is the future of humanity collectively making sure the vibes are right? From a certain point of view, what we have always done, collectively, is build a culture. And what is culture other than the right vibes? Perhaps our future is just a continuation of our history, with new technologies, and new tradeoffs.

    1. For a really detailed treatise on this whole idea, see Acemoglu and Johnson’s excellent article “Learning from Ricardo and Thompson: Machinery and Labor in the Early Industrial Revolution, and in the Age of AI↩︎
    2. And one of the creators of Google Docs among other things ↩︎
    3. Beltalowda! ↩︎
    4. As an aside this influenced various other uses of punch cards for data storage, leading to IBM and from thence to the fact your terminal defaults to 80 character widths ↩︎
  • Constraints & Orchestrators

    I recently read a few posts that helped connect the dots on why Python is a) so successful as the lingua franca of ML b) also seems likely to be successful in the future1.

    ML code reads like one program, but runs many: CUDA kernels, vectorized CPU loops, graph compilers and a bunch of glue code moving data around and tying things together. Python has continually improved at balancing two somewhat competing challenges: constraining the hot path so compilers can optimize it and structuring an orchestration path so humans can reason about it.

    Hot Path

    constrained languages are easier to optimize by Jynn Nelson touches on this:

    we should not be asking “what language can i use everywhere for every purpose”; we should build meta-languages that allow you to easily use the right tool for the job. this is already true for regular expressions and query languages; let’s go further. i want inline futhark; inline CSS selectors; inline datalog; ffi between python and C that’s trivially easy. the easier we make it to interop, the easier it becomes to pick the right tool for the job.

    Compilers are generally going to perform better if you have regular shapes, minimal side effects, predictable memory access and so on, but you want languages to be expressive and flexible, particularly when “research” is a big part of the work. In practice, that’s precisely what happens with ML : torch.compile lowers PyTorch graphs to an IR and (often) emits Triton kernels. Being able to hand off inner-loops to specialized languages allows compilers and runtimes to optimize and target the use cases they are best at.

    While this is (somewhat) clear for GPUs or other accelerators with distinctive programming models, I think it’s also largely true for getting the best out of modern CPUs. Daniel Lemire’s SEA 2025 talk covers nearly a decade of performance work and sums it up: modern CPUs do nearly as many instructions per cycle as you can feed them. To really maximize performance you need to batch work, reduce instruction counts and vectorize. We can do some of that in the general Python2 runtime but dynamic dispatch, aliasing and side effects all make the job a lot harder. We can add speculative guards, which can be hard to reason about, or give up and lose performance. By having DSLs3 that add additional constraints we can give ourselves the ability to get much, much higher performance without scrificing the overall flow of our program.

    Orchestration Path

    Python is unusually good as an orchestrator. From a readability perspective the language is baseline very readable and as long as libraries and DSLs stay Pythonic they tend to inherit that intelligibility. The challenge with orchestration is coordinating work in such a way that your most precious resources are well utilized. The investments in Free-Threaded Python make it a lot cheaper to do concurrency, but they don’t magically fix the challenge of coordination.

    asyncio: a library with too many sharp corners covers some of the many failure modes the community have encountered with asyncio, and makes a case for Trio or ANyIO style structured concurrency that allows for manageable failure modes.

    asyncio is not a good library. It is constantly full of sharp edges everywhere with implementation details leaking and poorly designed APIs forcing end users into odd code patterns to avoid fundamental flaws in the interfaces.

    This is very much a readability version of the constraints concern on the hot path. Threads are a bad app abstraction over shared mutable state, reasoning about races and cancellation is hard, and primitives are always leaky. But threads are a perfectly fine implementation detail behind a more constrained API, like task groups, or actors, or so on.

    One area that I do think needs sustained improvement is how we debug and trace across this kind of set up: it’s been challenging even in a controlled environment to really understand how all the pieces interact in a reasonably scaled ML workload, and I imagine that problem will only get worse. But I also expect that the flexibility and breadth of Python will end up a boon there as well.

    1. Beyond just sheer momentum, of course. ↩︎
    2. Or any language! Certainly for some optimizations having a JIT for Python would (and does) make life easier. ↩︎
    3. Whether that is an embedded JIT like Triton or a library+execution engine like Polars. ↩︎
  • The Tools Are Made Up

    The Tools Are Made Up

    It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder1.

    Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

    Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

    The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards2.

    The Kimi-K2 technical report goes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent3 agentic models have been built on a reasoning foundation.

    What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

    Their approach to tool training takes this kind of idea ever further:

    We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

    They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

    Once they have this tool zoo, the actual training loop involves:

    1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
    2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
    3. Filter trajectories using another LLM as judge to keep only successful ones for SFT

    They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code4: thousands of sandbox environments.

    For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software
    development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

    The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.

    1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! ↩︎
    2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. ↩︎
    3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! ↩︎
    4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing. ↩︎

  • RL in the second half

    The Second Half – Shunyu Yao – 姚顺雨

    Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”

     To recap the game of the first half:

    • We develop novel training methods or models that hillclimb benchmarks.
    • We create harder benchmarks and continue the loop.

    This game is being ruined because:

    Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

    The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.

    The post makes the point that the gap is benchmarks that more closely map to real-world problems.

    when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

    • We develop novel evaluation setups or tasks for real-world utility.
    • We solve them with the recipe or augment the recipe with novel components. Continue the loop.

    Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks1 have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation2. RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

    I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation3 towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!

    1. Jack Clark particularly ↩︎
    2. Receptive vocabulary vs. productive vocabulary. ↩︎
    3. A universal geometry of a vector field? ↩︎
  • Reinforcement Learning Continues To Be The Frontier

    Back in 2021, OpenAI nixed its robotics team, leading to comments on Hacker News like “Reinforcement learning itself is a dead-end on a road to AI”. Now, in 2025 we are surrounded by RL post-trained reasoning models and Mary Meeker is using the word “unprecedented” a lot. This kind of skepticism/hype overlap is very common right now, as Helen Toner breaks down in her excellent recent post/talk on unresolved questions in AI:

    Last year, we had coverage from the Wall Street Journal—really good reporting—about real challenges inside OpenAI with scaling up their pre-trained models and how difficult that was and how they weren’t happy with the results, and then on the literal same day we had the release of o3, the next generation of their reasoning model, and François Chollet—who’s famously skeptical—saying that it was a significant breakthrough on his ARC-AGI benchmark. So these very contradictory takes, both of which had some truth to them.

    The framing used in that post is really useful: it’s less about “are we making progress?” and more “are we on the right branch of the tech tree?”

    A lot of people thought RL was the wrong branch: after notable successes from DeepMind and OpenAI, RL had become a bit of a backwater, with some resurgence (in a limited form) from Reinforcement Learning with Human Feedback (RLHF) for preference tuning LLMs.

    The reason people keep coming back to reinforcement learning is the ability to discover new things. Supervised learning is somewhat inherently bound by the dataset. A reinforcement process can continue to explore and find new strategies, like the famous examples of AlphaGo choosing moves humans wouldn’t have. Tim Lee has an excellent non-technical introduction to the evolution of RL that mentions this: Reinforcement Learning Explained

    In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations

    In this direction, a recent paper, [2507.00432] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, suggests1 that reasoning generalizes better from RL-driven learning than supervised fine-tuning.

    RL-tuned models achieve significant gains on math reasoning while preserving positive transfer to other reasoning tasks and non-reasoning tasks, whereas SFT often incurs negative transfer on non-reasoning benchmarks. Second, PCA analysis of latent space confirms that RL induces minimal drift from backbone representations thus maintaining feature stability, while SFT produces larger latent shifts, especially in non-reasoning domains. Third, token-distribution analysis shows that RL selectively adjusts only a handful of task-relevant tokens, whereas SFT perturbs many irrelevant tokens, indicating RL’s more targeted optimization.

    RLHF is implemented by first training a reward model based on human preference feedback: you give people two versions of an answer, they tell you which one they prefer, you then train a model to predict those ratings. That reward model becomes the scoring function during post-training.

    Designing good reward functions has been somewhat of a dark art in RL. The agent optimizes what you ask for, which is not always what you really want2. This “reward hacking” phenomenon makes RL agents somewhat brittle, prone to exploiting loopholes in environments in ways no one anticipated.

    The recent reasoning models did so well because their rewards were verifiable: reward scores that are based on some ground truth validation and are often just yes/no: does code compile, does it pass a unit test, can a math proof be verified by a formal logic reasoner, or simply is the answer correct or not. Nathan Lambert did a breakdown on where RL goes next:

    The optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer.

    Lambert makes the point that even the very long-range tasks we have now (coding agents, deep research) are based around learning to be better at tasks individually, then stringing those together:

    How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.

    This problem of having to learn to act over a long time-horizon is a recurring one in RL. The best algorithms we have for reinforcement learning are online: the model learns “live” while interacting with the environment. But sometimes it’s a lot easier to collect data than it is to run an experiment: for example, it’s much safer to get a large amount of sensor input from driving cars around than it is to have a model driving a real car around and making mistakes. This is off-policy or offline RL, and it offers the promise of learning from much larger data sets.

    Seohong Park recently wrote a great post breaking down how offline RL fails to scale up: Q-Learning Is Not Yet Scalable3. In the experiment there the team at Berkeley generate 1000x more data to try and scale offline RL, and still see the process breaking down:

    Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).

    Noted LLM-branch skeptic (and technically a very distant colleague) Yann LeCun has spoken a lot about a version of this kind of planning and world modelling problem, which he sees as inherent to the autoregressive nature of LLMs: the accumulation of errors over long time horizons.

    One of his architectural bets is JEPA, and the recently released V-JEPA 2 paper is beginning to show how this could work. V-JEPA 2 is a self-supervised video world model trained on a million hours of YouTube video. The model learns in a semi-supervised fashion by masking out parts of video frames and predicting them, in latent (embedding) space rather than pixel space. After the pre-training, they freeze the encoder, generate tokens with it for a video and prepend those to a query for a pretrained LLM4 .They fine-tune that LLM on video question answering data, and were able to get state of the art question answering with that set up, despite the JEPA part of it being totally task agnostic.

    Going a step further, they took the encoder and hooked it up to a small robot control model5. They trained it on some robot operator data for pick-and-place tasks. It learned to do a remarkably good job, without any reinforcement learning at all!

    This is interesting because robotics has traditionally been an area where we have seen a lot of exploration (with success and disappointment!) with long-range RL. Andrew Stokols’ excellent post on ChinaTalk makes a good case that while the west has focused on AI in a brain-in-a-jar type way, there has been a concerted push in Beijing for Embodied AI (with Chinese Characteristics). China has a very strong base in manufacturing. Robotics, drones, autonomous vehicles are all being developed and deployed in the country.

    One of the fundamental challenges robotics systems have to address is much more constrained latency bounds: the world operates in real time, and running a big model may result in a smart robot that simply cannot respond quickly enough to be useful. The space has trended towards hierarchical models, which chunk actions into higher level concepts that a controller model puts out (like “pick up at x”) and lower-level models that decode those chunked outputs into a series of motor commands. While sometimes transformers are used autoregressively (take sequences of state, action and predict next action), many now use diffusion-based techniques where they will generate a whole trajectory at once. Physical Intelligence recently put out a paper on Real Time Chunking where they show you can start with generating a chunk, then continue the denoising process a-la inpainting or fill-in-the-middle to generate the steps between the start and goal, allowing more real time responses.

    China putting a lot of eggs in the embodied AI basket is indirectly also betting that methods to make those systems learn and adapt will mature. Some of those same techniques will invariably apply to the (disembodied) agents that are currently the focus on big labs in the west.

    1. One of the ways they corroborate this finding is by seeing there is less KL divergence in the RL trained model than the SFT model, but that’s usually a training objective on RL, and I’d imagine you could apply KL regularization to SFT as well if you wanted. ↩︎
    2. A classic example from OpenAI: A reinforcement learning agent in a boat race game was given points for hitting targets, so it happily learned to drive in circles hitting the same targets forever, instead of actually finishing the race. Faulty reward functions in the wild | OpenAI ↩︎
    3. Q-Learning is the most common class of algorithms for offline RL. ↩︎
    4. They unsquash it into the hidden dimension size, and depending on how the numbers work out add some pooling. ↩︎
    5. Much like with the LLM, they combine the video embeddings with model-specific tokens, in this case a state tracking input and the current state of the robot arm. ↩︎
  • Harnessing the Universal Geometry of Embeddings

    Harnessing the Universal Geometry of Embeddings

    What content should you include in an LLM prompt? Many interesting use cases (enterprise tools, coding assistants) have more content than they can handle at once, so you chunk it up, turn each chunk into a vector with some sentence‑encoder, and store those vectors in a database. Later you vector‑search, pull back the relevant chunks and feed them to the LLM — better known as the RAG pattern.

    The working assumption has been: those vectors are fairly safe. A 768‑dimensional point doesn’t look the text “Ian ordered a burger at 12:07 ”, so storing raw embeddings seem privacy‑preserving.

    But are they! In Cornell’s Harnessing the Universal Geometry of Embeddings paper the authors train vec2vec, a small GAN that learns to translate embeddings from encoder A’s space into encoder B’s space, without seeing the original sentences. Once you’re in encoder B‑land you can recover up to 80 % of the underlying text:

    Inversion, i.e., reconstruction of text inputs, is more ambitious than attribute inference. vec2vec translations retain enough semantic information that off-the-shelf, zero-shot inversion methods […] extract information for as many as 80% of documents given only their translated embeddings, for some model pairs (Figure 5). These inversions are imperfect and we leave development of specialized inverters for translated embeddings to future work. Nevertheless, as exemplified in Figure 6, they still extract information sucdh as individual and company names, dates, promotions, financial information, outages, and even lunch orders. In Appendix E, we show the prompt we use to measure extraction.

    The paper suggests that most sentence encoders trained with similar objectives on sufficiently diverse data come up with embeddings which resemble each other. Concepts, topics (and lunch) live on a shared manifold1; the models just might position them differently in embedding space. Vec2vec is a learned a coordinate transform.

    What this might be implying is that if you train a model with similar objectives on data samples from a similar generating function, you will arrive at a manifold in latent space that is geometrically similar to anyone else doing the same thing. If that is true operations in latent-space start to look less model specific, and approaches that navigate them (like JEPA, LDM editing) could learn to operate across different model with just an adapter layer.

    To be clear, the paper is not saying this: the authors only align English, contrastive‑loss, transformer sentence encoders. No decoder models, hardly any dimensionality mismatch. The phrase “universal geometry2” may be a stretch: Their GAN training also requires quite a bit of run cherry-picking3, and when they tried cross-modality the results weren’t as strong, but it’s a very interesting idea worth further investigation.

    In the short term, this is probably mildly alarming for coding agent customers that are worried about their source code leaking, but in the long term I hope we can see some more investigation into how true this is in more general modeling, and what kind of opportunities that might open up!

    1. Shape in the embedding space. In practical experience when you have a large embedding space its mostly empty, and all the actual data lives on a plane in the space. This is why things like latent diffusion models work: they learn to navigate towards that plane from any random noisy point in the space. ↩︎
    2. But it’s a great title. ↩︎
    3. My understanding is unstable training is a very common problem for GANs. ↩︎
  • How to build unmaintainable kernels

    What do you need to do to get better performance and GPU efficiency out of your model? The GPU-oriented folks at Stanford recently published an early preview of the work they have been doing on the LLM generation of kernels: Surprisingly Fast AI-Generated Kernels We Didn’t Mean to Publish (Yet) – and they have a list:

    • Memory Access Optimization: improving the efficiency of data movement between different memory hierarchies (global memory, shared memory, registers) and ensuring data is accessed in a way that maximizes bandwidth and minimizes conflicts.
    • Asynchronous Operations & Latency Hiding: hide the latency of slow operations (like global memory access) by overlapping them with computation or other memory transfers
    • Data Type & Precision Optimization: using lower-precision data types (like FP16 or BF16) where possible to reduce memory bandwidth requirements, increase cache effectiveness, and potentially leverage specialized hardware units.
    • Compute & Instruction Optimization: making the arithmetic computations themselves more efficient, reducing instruction count, or leveraging specialized hardware instructions
    • Parallelism & Occupancy Enhancement: maximize the number of active warps on the Streaming Multiprocessors (SMs) to better hide latencies and improve overall throughput
    • Control Flow & Loop Optimization: reducing the overhead associated with loops, branches, and indexing calculations

    That’s a good list! In this case though, it emerged not from (just) talking with kernel experts, but also from developing a model to generate kernels:

    We have some very fast AI-generated kernels in pure CUDA-C without using libraries and DSLs such as CUTLASS and Triton. They are performing close to or in some cases even beating the standard expert-optimized production kernels shipped in PyTorch.

    They developed a very straightforward but smart pattern on structuring test-time-compute. They reason about optimizations in natural language before generating code. Then, they branch out into a tree structure of refinements for each optimization idea, to avoid loops in investigation.

    The kernels they generated were somewhere between fast, and very fast:

    Conv2D: 179.9% performance of FP32 torch.nn.Conv2D; problem size: (100, 3, 224, 224) input tensor, conv(in_channels=3, out_channels=96, kernel_size=11, stride=4, padding=2)

    The team aren’t claiming this is a general solution, but just an interesting proof of possibility, which is certainly is! The walk through of how they got to the final conv2D kernel is fascinating, both in terms of human intervention and the chain of optimizations.

    The final code sample for the Conv2D kernel is included in the appendix. It uses advanced CUDA techniques that we find challenging to write ourselves! We also have more example kernels in this Github repo

    The kernel is very fast for specific shapes, on the L40s, in FP32. Its also a kernel that, by the sounds of it, the team themselves struggled a bit with. It’s very, very specialized. It’s not that a human couldn’t have built it, its that (in most cases) they wouldn’t: it’s not a priority kernel, and all that clever CUDA comes with operational overhead, ties on specific hardware, shapes and so on.

    That in itself isn’t new. If you PyTorch or XLA compile you’ll get a lot of kernels which you probably wouldn’t write, but this adds a new (and weird!) layer of non-determinism to everything. Elsewhere at Stanford, they have been looking at one of the other killers of GPU efficiency: kernel launch overhead. Most models are represented by hundreds of kernels, each of which have to be scheduled from the CPU. LLMs are generally memory-bound, small ones particularly so, and the gaps between kernel executions can end up dominating performance:

    Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B:

    In this post, we show how we can bypass this problem by merging the entire Llama-1B forward pass into a single “megakernel” that eliminates kernel boundaries altogether. Doing this achieves brr – on an H100, we use 78% of memory bandwidth and outperform existing systems by over 1.5x. (To our knowledge, this is the lowest-latency forward pass for Llama-1B in bfloat16!) In the rest of this post, we’ll walk through how and why one would do this.

    The idea of megakernels that handle all of the operations is not new, but the complexity of fusing everything together is high. Persistent kernel were popularized at the tail end of the CUDA 11 series, due to the right residency and async copy support in Ampere. They allow leaving kernels resident on an SM and having them pull a series of tiles to do their work rather than scheduling repeated launches of the same kernel. The megakernel takes this idea ever further, with multiple operations within the kernel that pulls a stream of different problems. One issue with this approach (traditionally) is register spilling: you only have so many registers available, up to 255 per thread, though with a fairly high overall limit of 64k 32-bit registers (on Hopper). That means you need to keep some data in shared memory, and efficient use of shared memory ends up being the bottleneck. The team at Stanford developed paging shared memory, with a separate reserved block for managing allocations of shared memory to individual tasks.

    This gets the CPU completely out the picture for the forward pass, but is incredibly specific to the model (in this case Llama 3.2 1B).

    Another collaboration was clearly thinking in the same direction, as they recently posted about their Mirage Persistent Kernel: a compiler for megakernels.

    our team from CMU, UW, Berkeley, NVIDIA, and Tsinghua developed Mirage Persistent Kernel (MPK) — a compiler and runtime system that automatically transforms multi-GPU LLM inference into a high-performance megakernel. MPK unlocks the benefits of end-to-end GPU fusion while requiring minimal manual effort from developers.

    The system works by building a task-graph of what they call LAX1[1] fragments, which in practice is a very short list of 14 operators. This is actually too small to represent everything they need, meaning they have to manually decompose some common ops like ReLu, but this level of decomposition gives them an ability to do some pretty complex fusions.

    The actual ops are generated thanks to Mirage’s Kernel Superoptimizer (a great name), which I think is a very intense autotuner:

    Mirage automatically searches for possible GPU kernels for attention. The search space includes existing manually designed attention kernels (e.g., FlashAttention and FlashDecoding) as special cases. It also includes other implementations that outperform today’s handwritten ones by up to 3.5x for certain use cases. The GPU kernels generated by Mirage can directly operate on PyTorch tensors and be called in your PyTorch program.

    The search is not cheap though:

    In our evaluation, Mirage takes up to 4 hours to optimize a Lax program. This optimization is a one-time cost before deployment on the target hardware.

    The aggressive decomposition allows them to have a clever verification scheme where they validate kernels on random inputs to get confidence in (approximate) numerical correctness.

    They then build a worker kernel with all the relevant operations, and schedule the optimized graph via dedicated scheduler warps. Workers are scheduled on the SMs, and report back status. The scheduler warps then decide when tasks can be enqued for execution.

    They’ve got a code example that walks through setting it up for Qwen. They recreate the model structure explicitly, generate a task graph from it, and kick off the search for optimal kernels and fusions. This avoids the need to solve the Dynamo-style problem of tracing the model!

    The resulting kernel is again heavily tied to the specific hardware and model. One thing we have found useful for investigating production problems is that the ability to ablate different parts of the compile process, running models in (basically) PyTorch eager mode. This approach leaves the darkest of black boxes to work with, and I would imagine even more terrifying PTX than the complex CUDA that the LLM kernel generation team came up with.

    Between these projects though, it feels like we are exploring the edges of what running a program on GPUs should actually look like: a combination of kernel generation and multi-level search seems almost guaranteed to yield optimizations that would be far outside the cost-benefit curve for manual implementation. What we don’t have yet is known ways to operationalize this kind of approach, but its an exciting area to watch!

    Thanks to Matt for nudging me to actually read through these papers, they’d been on my todolist for a bit!


    1. I am not sure what this stands for, but the basic ops in jax are in jax.lax, so I presume its the same source! ↩︎

  • AbsenceBench

    https://arxiv.org/abs/2506.11440

    Simon Willison has a good summary:

    Long context models have been getting increasingly good at passing “Needle in a Haystack” tests recently, but what about a problem in the opposite direction?

    The answers are surprisingly domain-specific; some models do great on numeric sequences but most are pretty bad at code!

    The authors posit that attention is just a worse mechanism for seeing what’s missing vs what’s there. For me this rhymes with the experience of folks doing agentic coding assistant work: its beneficial to clear the context window more often than you think as the models strongly prefer to use what is already in there.

    This feels like a learned or tuned behavior, a flavor of the model does the eval. Models will probably get better at this problem, as now it’s legible, but is there a tradeoff that has to be made?

    Pretraining is somewhat saturating, but we have oodles of post-training (which includes context extension), the whole meta-RL process of researchers trying different data mixes and algorithm/architecture tweaks, and inference time search options.

    If OpenAI had Anthropic’s data and evals would they have as good an agentic coding model? And vice versa would Opus be as good at deep research as O3? I honestly don’t know: in the end compute will always be finite and we have to allocate it with some end in mind. It feels very plausible there is no globally optimal scaling law for how you prioritize different model capabilities. But the models will probably do this eval.

  • A patchwork quilt view of AI Alignment

    https://arxiv.org/abs/2505.05197

    Very interesting paper from folks at DeepMind that is focused on arguing that the idea of a convergence of a single, coherent value set doesn’t reflect society and is not the only way to think about AI morality and alignment.

    Think of society as a patchwork quilt composed of diverse communities, each with its own internal norms and expectations. Within these communities—e.g. trade unions enforcing strike discipline or religious groups defining attendance practices—members understand the specific rules and the social consequences for deviating (Bicchieri, 2005; Ostrom, 1990; Ullmann-Margalit, 1977). These internal dynamics shape behavior within each distinct patch of the quilt, fostering cohesion through shared, localized understandings of appropriate conduct

    […]

    A key insight we can draw then is that what holds humans together despite profound disagreements is not value convergence, but practical mechanisms for coexistence—which we see as social technologies

    There is an idea that sometimes comes up that disagreements between good, reasonable people can be traced to misunderstandings or disagreements about the likelihood of different outcomes; if you can align on them, you’ll come to the same conclusions. This encourages some focus in AI alignment on finding the right, true principals, creating the best truth-seeking model possible, then assuming downstream that will result in strong alignment. The paper challenges this assumption.

    They also call out collective action problems in implementing such a framework, particularly start up and free rider problems:

    Even seemingly universal goods like “survival” are embedded in thick cultural contexts that shape their meaning and priority (in fact many cultures prioritize sacred values above survival e.g. Ginges et al. (2007)). In general, mobilizing global action and resources towards any specific AI safety strategy will inevitably confront deep-seated disagreements rooted in different values, priorities, and worldviews regarding the nature of AI risks, the efficacy or fairness of proposed initial strategies, and the equitable distribution of upfront costs and responsibilities.

    Their approach calls for focusing on 4 areas:

    1. Contextual grounding: broader understanding, not just the conversation but the environmental context they are operating in.
    2. Community customization: Different norms for different communities.
    3. Continual adaption: Updating understanding of appropriate behavior based on ongoing feedback. They suggest continuous training for this.
    4. Polycentric governance: Distributed decision making wiht multiple overlapping centers of authority.

    If you read this list into a general “helpful agent” context instead of alignment, I don’t think it would be controversial: these seem good ideas!

    That said, I think a lot of this boils down to the last one. Getting governance structures right is hard, in any context, and I interpret a key part of the aspiration here as having “checks and balances” that can represent varied interests. Not an easy problem to solve!

    Some might worry that our patchwork approach embraces a troubling relativism, but this misunderstands the quilt we’re describing. Just as a quilt’s structural integrity depends on solid stitching techniques regardless of pattern diversity, our appropriateness framework recognizes that while the
    specific content of norms varies across contexts, the social technologies that facilitate coexistence—the mechanisms of learning, adaptation, and conflict resolution—can be studied with rigor and implemented with care.

  • Toward a Theory of Tokenization in LLMs

    [2404.08335] Toward a Theory of Tokenization in LLMs

    Tokenization has always struck me as one of the odder aspects of natural language deep learning. Despite the extensive end-to-end learning processes we typically use, tokenization initially involves creating a dictionary of optimal sub-word segments from your dataset. One of the appealing concepts in the Byte Latent Transformers paper is the potential to learn tokenization dynamically, recognizing that tokenizers solve deeper problems than merely providing a fixed vocabulary.

    This paper addresses tokenization from a theoretical perspective by modeling sequences using kth-order Markov processes, where the likelihood of each token depends on the preceding sequence, as in natural language. The parameter k corresponds to the model’s context window size. Key findings include:

    1. Training without tokenization leads models to effectively behave as unigram predictors, significantly limiting performance.
    2. Using a well-designed tokenizer (e.g., Byte Pair Encoding – BPE) enables models to achieve nearly optimal performance in capturing sequence dependencies.
    3. Increasing the tokenizer’s dictionary size improves the model’s performance, moving it closer to the ideal probability distribution.

    Tokenizers which do a good job at learning patterns in the data and assigning these frequent patterns as tokens in the dictionary are compatible with an i.i.d. model over tokens.

    This insight suggests that despite the complexity of natural language’, a good tokenizer converts sequences into something approximating an independent and identically distributed (i.i.d.) format, which brings the modeling tasks for transformers closer to the one they can solve.

    While the paper does not explicitly explore the Byte Latent approach, I wonder if its entropy-driven dynamic token allocation might similarly achieve this i.i.d. simplification. In BLT the entropy model, trained separately, could be dynamically transform inputs into a distribution that is more palatable for transformers.

  • Darwin Gödel Machines

    https://open.substack.com/pub/gonzoml/p/darwin-godel-machine

    Interesting paper breakdown on Gonzo ML of another evolutionary agent approach from the extended Sakanaverse.

    It commences with an initial coding agent, constructed upon a frozen foundation model (FM) equipped with tool-use capabilities (e.g. running bash commands, editing files). In each cycle, “parent” agents are selected from the expanding archive. This selection process prioritizes agents based on a combination of their performance (assigning greater weight to higher scores, scaled by sigmoid) and a novelty bonus (inversely correlated with the number of offspring they have already produced, thereby encouraging exploration of less-frequented paths).

    The actual foundation model is a frozen component, so much like alphaevolve this is a search set up on top of the model intelligence. The search is evolving the agent code itself to try and do better on benchmarks.

    Qualitatively, the DGM learned to enhance its own tools and workflows. For instance, it developed more granular file editing capabilities (e.g., string replacement), improved long-context window management (e.g., auto-summarizing prior interactions), and refined its problem-solving strategies (e.g., making multiple attempts at a solution and using another FM to evaluate patches). These discovered improvements also demonstrated generalizability, transferring benefits across different underlying FMs and programming languages.

    When it comes to coding agents I had been thinking there were three axes of performance, which gate the overall effectiveness, but the paper makes it clear there are at least 4:

    1. The foundation model itself, with its base coding, tool use, reasoning abilities and context window size
    2. The tools it has available – the more the tool is exposes underlying semantics the more the model can efficiently use it.
    3. The UI, how the user interacts with the agent to direct it, provide clarity and review work.
    4. The prompt, strategies for problem solving and how the context window is managed (eg when to summarize)

    In this case the UI is held fixed (an outer eval loop), the model is fixed and the search explores tools and strategies. It seems at the very least a search across multiple different models as options might also work well!

  • Scientific discovery and AI

    I got fooled by AI-for-science hype—here’s what it taught me

    LinkedIn-esque headline but a good guest post on Tim Lee’s Understanding AI newsletter by physicist Nick McGreivy.

    Later research found that scientists who use AI are more likely to publish top-cited papers and receive on average three times as many citations. With such strong incentives to use AI, it isn’t surprising that so many scientists are doing so.

    So even when AI achieves genuinely impressive results in science, that doesn’t mean that AI has done something useful for science. More often, it reflects only the potential of AI to be useful down the road.

    The problems Nick describes where he find PDE solving (the area he was looking into) had a lot of techniques which didn’t end up improving on non-ML approaches, feels very common. AI research likes to hill-climb metrics. It’s often the lack of progress on a certain benchmark that motivates new techniques, like the growth of test-time compute over the past year to drive math and logic performance higher.

    It brings to mind Zhengdong Wang’s fantastic year-in-review letter from last year.

    The model does the eval is the backbone of how one should access and marshall their intuitions into a coherent view on AI progress.

    The first awesome conclusion of the model does the eval is that we will achieve every evaluation we can state. Recall that evaluations must be legible, fast, and either a good approximation of a wanted capability or useful itself. The plummeting cost of compute has made all evaluations faster.

    […]

    Add human intelligence to direct the cheaper compute to get more legible evaluations. Two years ago, Demis Hassabis enumerated three properties of problems suitable for AI: a massive combinatorial search space, a clear objective function to optimize against, and lots of data or an efficient simulator.

    We tend to succeed where we have the evals and we have the data. Having the evals also starts to create a common lingua-franca to discuss relative performance, not that eliminates the baseline hacking Nick discusses.

    The evals are often tied to having a good quality core data set that can be used for both training and evaluations. Even in areas where we have had scientific progress, mainly AlphaFold and descendants, as Derek Lowe often writes, we have a major leg-up with the existence of the PDB, an extensive database of high-quality protein structures created by people.

    When we look back at major breakthroughs, we often credit that aspect: Dr Fei-Fei Li is one of the pioneers of deep learning thanks in part to the creation of ImageNet. I hope that one takeaway of scientists reading Nick’s note is that the creation of quality benchmarks and datasets can drive more progress than the application of (or innovation on) new ML techniques themselves!