Category: posts

Who is walking who?
One good way to annoy a neuroscientist is to compare an LLM to the brain. It’s appealing though! There are similarities! In infancy we take a complex fusion of sensory inputs and learn to make predictions in latent space, while in pre-training a stack of Transformers learn to predict which number SolidGoldMagikarp will say next on Reddit.

The actual life of an LLM is much less human though. Humans (possibly even SolidGoldMagikarp) do learn and adapt throughout their lives. LLMs go through pre-, mid- and post-training before being frozen, then their corpse is animated for money. In some ways this is more similar to a sea-squirt¹: sea-squirt larva are tadpole-ish creatures with a little nervous system that they can use to swim and sense and find a suitable surface to attach to. Once attached, they eat their own brain: all that machinery is broken down and reabsorbed. How good that brain was matters though: if the sea-squirt finds a good home it will be more likely to pass on those good-brain genes to its descendants.

LLMs evolve this way too, which is a cause of drama in biologist circles. There are some ongoing arguments about whether LLMs are evolving in a safe, domesticated, corgi-like manner, or are at risk of going feral:

Drawing on biological evolution and decades of digital evolution experiments, we distinguish “breeder” scenarios, in which humans impose fitness criteria and control reproduction, from “ecosystem” scenarios, in which selection arises from open environments and control erodes. In the latter, selfish replication reliably gives rise to cheating, parasitism, deception, and manipulation, even in very simple systems.

But what does evolution for LLMs look like? An LLM is really a data set, an architecture, and a training regime, and the many choices in each of those are somewhat gene-like. In the earlier days of deep learning, passing them on was a manual process and selection was driven through publication and other researchers reading the paper: if an idea was interesting enough to publish, and convinced a committee of reviewers, it became visible and passed on. Then it got a bit cruder, but also harder to finesse: does the model do the benchmark? Nowadays we have profit and loss: how many people are buying tokens from these models?

Those tokens themselves propagate ideas, too. If you use that successful model itself as a judge or to generate synthetic data you inherit some of its choices. Academic LLM inheritance is basically Lamarckian: researcher insights get acquired and inherited quickly. But modern LLMs are part of a Darwinian market layer, red in tooth and GPU allocations.

When the environment gets Darwinian, biology tends to push organisms towards niche construction: shaping that environment itself to better fit the organism.

Armin Ronacher of the Pi harness wrote about an issue where newer, better models did a surprisingly worse job of using Pi’s edit tool:

When Opus 4.5 launched, it adapted to other edit tools exceptionally well. In fact, I was pretty convinced that we’re on a good path where the models are more likely to adapt to any sort of tool shape that comes around for as long as the instructions are good.
Now I’m somewhat worried about the track we’re on here. Alternative tool schemas might not just be unfamiliar. They might be implicitly punished by post-training that optimizes for one particular, forgiving tool ecology. And that ecology is not documented.

Post-training Claude on a particular shape of edit tool doesn’t just make Claude better on that edit tool, it encourages harness authors to support that shape. It propagates that shape of edit tool. That itself leads to more traces which use that edit tool, which propagate that technique into otherwise unrelated models. Claude is constructing a niche where edit tools bend towards its preferred approach, despite that approach being, as Ronacher complains, undocumented.

LLMs don’t need to go feral (or become misaligned superintelligent replicators) to shape the world around them to be more amenable to their success, or to pass on those preferences to future models. That’s a good thing? While we still don’t have much of an idea of how to do model interpretability, we do pretty much know how to make an edit-tool API.
1. Thanks Octonauts! ↩︎
July 11, 2026
Benchmarks Mean Business
The basic job of an eval is let you judge how good your model is on a task. If enough people use the same eval we can use it to benchmark the relative performance of multiple models on a level playing field. All good, no drama.

But building good benchmarks is hard! ImageNet was a tremendous effort by Fei-Fei Li, her team, and a whole lot grad students, to produce a massive (for the time) labelled image set. It was incredibly effective though: it created a Schelling point that drew attention from so many different researchers it fundamentally advanced computer vision, and, thanks to AlexNet, pretty much made deep learning cool.

The benefit for folks creating a widely-adopted benchmark was twofold: everyone that uses it cites you, and that is good for your H-index! But, more aspirationally, you get to shape where the field goes. Glue/SuperGlue helped do that for language modeling, and SWEBench did it for coding.

Also, now, there is money!¹

Arena reached a $100M annual revenue run rate just 8 months after launching our evaluation product. We started as a research project at UC Berkeley with a simple mission: measure AI progress through real-world use. As AI shifts from chatbots to agents taking on longer, higher-stakes work, the problem matters more than ever.
Today, Arena measures real-world AI utility with a community of tens of millions. With Agent Arena, we’re evaluating long-running agents on complex, real-world tasks – how they use tools, adapt to feedback, recover from errors, and accomplish goals set by humans.

There isn’t just money in running the evals either. Being SOTA on a particular benchmark can be a headline claim for labs pitching their new models. While Arena now covers long-running coding tasks, they became famous for their blind-bake-off ChatbotArena. For a while, topping that was worth real money to the labs: in adoption, in VC dollars, and in the ability to recruit top talent. So, maybe, there might have been a tiny bit of gaming the system (though Arena, explicitly, refute this):²

We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results.

The labs actively want to hill-climb on the metrics they report, which usually means tweaking and testing on some subset, and holding another set back for uncontaminated validation. An evaluation like ChatbotArena doesn’t work like that, which makes it a good benchmark, but it does mean that you want as many samples as you can get to check whether you are going in the right direction. And it would be nice not to show the bad ones.

the over-reliance on a single leaderboard creates a risk that providers may overfit to the aspects of leaderboard performance, without genuinely advancing the technology in meaningful ways

Some benchmark providers try to tie themselves more explicitly to different business models. Epoch publishes capability research, but they also offer “mission-aligned services to companies, nonprofits, and government bodies, including commissioned research, model evaluations, and consultations”, for folks like the UK Dept of Science & Innovation.

In the finance world there are businesses called rating agencies, and they, unsurprisingly, rate things. Most famously they rate how reliable a company is at paying back its debt. That sounds purely informational, but it is something more than that. For example, certain investors can only hold debt rated above some threshold, so if the ratings agency downgrades the debt then those investors might have to sell it. The ratings both help the market price the debt, but they also, in many ways, define what the market for debt looks like.

Right now, the absolute most valuable attribute a model can have is long-horizon coding capabilities.³ And Epoch’s latest benchmark is called MirrorCode.

AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s output exactly on end-to-end tests, including held-out tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

This may remind you a little of last month’s release of ProgramBench from many of the folks behind SWEBench at Meta, Stanford & Harvard:

In each task, the agent receives an executable and its documentation, and it must re-implement the given executable. It does not get access to any of the executable’s source code, it cannot de-compile the executable, and cannot use the internet. There are 200 tasks in total covering different program complexities, ranging from small terminal utilities like jq and ripgrep to massive software projects like the PHP compiler, FFmpeg, and SQLite.

Both of these are metrics which judge whether an agentic model can build a complex CLI tool from scratch, but they put different constraints on it.

ProgramBench is a black-box: the model gets the executable and its documentation, but can’t decompile it. It has to reimplement cleanly, and match a hidden test suite generated by fuzzing the original binary. There are tasks from small tools up to giant libraries, and the tasks only count as “done” if 100% of the tests pass within a 6-hour time limit. On release, no models cleared that bar.⁴

MirrorCode on the other hand adds a detailed specification and whole bunch of visible tests. There are still some tests held out, so agents can’t just replicate the expected test outputs. Given the extra context, and without a time limit, some of the models did get to the finish line: Opus 4.7 managed to reimplement a bioinformatics toolkit called gotree in a 14 hour run

The tasks are similar, but the incentives are a bit different. ProgramBench is trying to establish the frontier: what problems are hard but doable by humans, with lot of room for models to hill-climb. That’s a valuable thing to have if you are trying to build a frontier model, and especially if you want to compare how well you are doing at that to other frontier-model labs.

MirrorCode is testing how long models can do useful, correct, software engineering work. That is a very valuable thing to know if you happen to be spending a whole bunch of money on tokens to do useful, correct software engineering work, and you want to know where to allocate them.

Benchmarks, and the teams putting them out, have found themselves in a similar position to the ratings agencies. They help evaluate how good a model is, but they also define what good even looks like, and by extension, how a lot of decisions get made.
1. You may note that Arena report ARR, which is a SaaS world number based on looking at your subscribers and churn rate. But you don’t pay Arena like that! They are pay-as-you-go, so technically its “annualized consumption run rate”. That’s a new term to me! This is all very entertaining if you are in the intersection of people who reads research papers and S1s, but for everyone else I’d just note they last raised at $1.7b when their ACRR (?) was less than a third of what it is now. The goose has been valued. ↩︎
2. Sharp-eyed readers might note that the headline example is Meta, and I also work there. But in the spirit of industry solidarity I will note the paper called out Google (admittedly I used to work there) and OpenAI (they are free from my malign influence) too. ↩︎
3. The second most important being a good relationship with the United States Secretary of Commerce. ↩︎
4. Though some got quite close, and subsequently GPT 5.5, Opus 4.8 and Fable have all completed some tasks. Unrelatedly, one of the fun notes from the authors was that the models would often just write the program in Python, regardless of how the original was implemented. Years of arguing about languages on the internet wasted. ↩︎
June 30, 2026
It’s always the learning rates
Pre-training any kind of good LLM is very, very expensive. Thankfully, we have scaling laws. Lilian Weng of Thinky writes:

Scaling laws are one of the most critical empirical findings in deep learning. The observation is simple in form: the training loss decreases predictably as we scale up model size N, dataset size D, and compute C, following a power-law curve, which appears as a straight line on a log-log plot. We can view scaling laws as a framework for describing the relationship between compute, loss, model size and data; at its core, it is about how to allocate precious compute optimally between N and D

This predictability makes scaling laws highly valuable in practice. A common workflow is to fit scaling laws on a handful of small runs and then extrapolate to estimate the token and compute requirements for larger models.

Being able to do that reliably was an important discovery, because the general level of understanding of deep models was “huh, guess that worked”. Which got expensive, quickly, if it didn’t.

One important set of experiments is to choose the hyper-parameters, particularly the learning rate, which can have an enormous impact on the model. If you don’t choose the right learning rate you might completely misclassify the value of an architectural change. The general approach is to try a bunch of learning rates while holding D and C constant, plot loss against the LR, fit a curve and select the lowest point.¹

But still, large models are… larger. The learning rate influences how much the model updates based on the loss from each batch. If you blindly apply the same learning rate from a small model to a big one you will (generally!) get worse results. If you change more parameters for an update you need to reduce the learning rate to make each update-step similar scale. And if you are training with more data, you also need to update less for each batch in order to keep the learning smooth across the run. Different modules in the model will scale up differently, which can, for example, lead to logits exploding in attention blocks.

In 2022, Yang, Hu et al proposed Maximal Update Parametrization (μP) and Hyperparameter Transfer (μTransfer)², a recipe for taking a learning rate from a small model and transferring it to a bigger model:

For any fixed family of models with varying width and depth (such as the BERT family or the GPT-3 family), we only need to tune a single small model and can reuse its HPs for all models in the family. For example, we will use this technique to tune BERT-base (110M parameters) and BERT-large (350M parameters) simultaneously by transferring from a 13M model.

Lilian’s post isn’t really just about scaling laws though, it’s about how to screw up when using scaling laws:

Despite its clean form, in practice, scaling law fitting can be surprisingly sensitive to seemingly trivial procedural choices, like how you count parameters, how you round the precision, how you sum or average the loss, etc.

Because a scaling law is only fit on the (relatively small, relatively cheap) models that we can afford to train, and the prediction is extrapolated for a model orders of magnitude larger. In such a setup, choices that look like rounding error may lead to wild differences in prediction.

Scaling laws only hold when you keep a lot of things constant, and it can be very easy to either tweak something that breaks your assumptions or take too much confidence in a noisy sample you are going to extrapolate from.

As one example, earlier this year, Zhou, Xing et al. at the Shanghai AI Lab published a paper, How to set the learning rate for large-scale pre-training?, where they attempted to derive useful guidance for something close to the modern LLM recipe: MoEs trained under WSD rather than cosine annealing.³

They spend a bunch of time implementing a solid μTransfer route, then conclude… they shouldn’t. Just fit the LR directly! To make this easier, they cut down the search space:

1) Use just a handful (7 in their test) of learning rates for each scale.
2) Train a smaller proportion amount of data (in their case about 25%).
3) Keep the width-to-depth ratio fixed.

They then plot a surface across their different scales, fit a surface, and pick the appropriate LR for their target scale.

What this gets you is a single, global learning rate. Which, surprisingly, works even when extrapolated up to 10x. This is a bit of a departure, and it turns out to work because of the (now-standard) adoption of another3 architectural change: QK-Norm.⁴ Since QK-Norm stops the attention logits blowing up, it removes the need for per-module scaling that Yang et al. originally argued for!

One of the consistently surprising things in LLMs is how often you can’t tell how strong a model will until it’s fully trained. Many of the $1B researchers out there are folks who say things like “it’s always the learning rate”, take a look at your loss plot and then fix your training run by normalizing two matrices.
1. The theory is that loss – log(LR) is invex: any local minimum is also a global minimum, so you can just solve for a stationary point. This is a little bit more general than convex (bowl-shaped): though convex shapes are also invex, invex allows for weird flat spots. Whether it actually is invex as a rule, who knows, but it works well enough: Deepseek found that in there is actually a pretty big valley where all the LRs are kinda fine, so it works in practice. BUT! What you pick might still matter if extrapolate from it to a much larger model, which is pretty much Lilian Weng’s whole point. ↩︎
2. If this is greek to you, µ is “mu”. ↩︎
3. Cosine decay was the general baseline for learning rate annealing: do a warm up to the target learning rate, then decrease it over the training data size. But you need to know the total training data size! But then people started doing massive training runs and wanting to YOLO in data as they went. Warmup Stable Decay training keeps the warmup then just leaves the LR high. You cool it down in a decay phase when you want to use a checkpoint for something. There is a paper that goes into this from Stanford with the subtitle River Valley Loss Perspective, which feels like poetic Chinese, 河谷损失观. ↩︎
4. They did a great job with the ablations here, so we can be pretty sure that this is the reason! ↩︎
June 28, 2026
LLMs are complicated now
Back in 2022 and 2023 there were two big branches of machine learning happening at Meta¹. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.

Seb Raschka maintains an excellent gallery of model architectures. You can use it to diff two of the best open models of their respective eras, Llama 3 and Nemotron 3 Ultra.

Attention might be all you need, but modern models certainly use a lot of different variants of it: query grouping, compressed, sparse, linear, sliding-window and more. Mixture-of-Experts added selective routing to feed-forward layers, and we have since started routing just about everything else too, from attention blocks to the residual stream. Vision and audio encoders have gone from bolted on to mixed-in, and models have scaled to run at inference time across multiple GPUs, which throws comms ops in that add extra boundaries in the middle of your model.

This is not too different from what happened with recsys. The basic architecture of recommendation systems, for the best part of a decade, was a relatively straightforward two-tower sparse neural net. The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference.

It’s tempting to assume that agents will Fix This: that you’ll hand your PyTorch or JAX definition to Claude Telenovela or whatever and have it generate optimally fused kernels². To make that work you need a fixed, usable baseline to make sure that what is generated is… right.

What happened with recsys was that the gap between performance being an optimization and performance being a necessity became very, very small. Conceptually you can keep a pure model definition that gives you a baseline; in practice, training and testing a model takes significant resources and performance improvements become load-bearing.

If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can’t afford for it to be an order-of-magnitude worse. If A is fused and optimized, you need at least a partially fused and optimized version of B before you can even tell whether it’s worth exploring. The research iteration loop demands a different kind of flexibility than just “optimize this known quantity”. You can’t hand-fuse your way back without investing significant time that might not be worth it, and you can’t generate your way forward without a baseline to check. The only way out is to design for composability up front.

One of my favorite kernel developments of the last few years was FlexAttention in PyTorch, which took a whole class of attention operations and allowed you to generate kernels for them, via Triton templates. It built on a huge body of work in attention kernels, and it was designed to be composable and verifiable up front: you can explore with only a very mild impact to performance.

Andrej Karpathy recently joined Anthropic, in part to develop richer auto-research-style loops at the frontier. As he has spent the last few years showing, though, being able to cut architectures to their essence and make them composable is as important as a clever agentic setup in climbing that kind of hill.
1. And many smaller ones, shout outs to all my Content Understanding and integrity peeps ↩︎
2. Like an automated Hazy Research ↩︎
June 19, 2026
FactWorld
When we started building LLMs, we mostly focused on them knowing things. They had information encoded in their weights, and they could spit it out when given sufficient prompts. But an agent doesn’t just need to know things; it needs to combine several kinds of knowledge.

A lot of that is still in the weights: facts that the model learned during training. But some knowledge is in the context window: tool results, documents, user instructions, intermediate observations, etc. And some knowledge is in the environment: a good agent should have a sense of the current state of the world. To be useful, an agent has to be able to combine these sources of knowledge appropriately.

There are standard ways to test some of this. Associative recall benchmarks like MQAR ask whether a model can recover a value from a key in its context window. State tracking problems, like S5-style permutations, check whether a model can keep track of changes over time: the problems are a series of operations, and a model must identify the end state.

Different architectures solve these problems in quite distinct ways. Transformers are good at recall; in the end that’s what attention is: look back into the context, copy the relevant things. They have an inductive bias for this kind of problem: the nature of their algorithm fits the nature of the problem. When it comes to state tracking, though, they’re brittle. They memorize the state-tracking mechanism for the lengths of problem they see in training: give them something longer, and they don’t degrade so much as collapse.

Recurrent models, like RNNs and state-space models, have the opposite shape. They have a natural inductive bias towards maintaining state. They keep a compact representation of The Current Thing and update it as tokens come in. That makes them effective at tracking state across time, but the conventional wisdom is that it costs them recall: the representation is fuzzy, and copying exact references back out of it is harder.

One current trend in LLMs¹ is hybrid models, where regular attention is interleaved with linear attention or state-space style layers. This is, usually, framed around efficiency: the linear layers don’t need the large KV cache. I wondered whether the hybrid might also give you both capabilities: strong state tracking and strong recall, in the same model, for the same query.

To test this, I vibed up a benchmark called FactWorld. It’s a small, synthetic world of agents, objects, roles, and facts. Everything is generated from a deterministic knowledge base, with labels computed by a symbolic oracle, so every answer is correct by construction and nothing leaks from the rendered text.

The world looks like this: agents (g0, g1, …) each carry a static fact (“g3’s a0 is v42”), and objects get passed around over time (“give o3 to g1”). The queries cross the two capabilities:
- Recall : “what is a0 of g3?” Look up a fact.
- State tracking: “who holds o3?” Replay the give-history; last write wins.
- Composition : “what is a0 of the holder of o3?” Determine who holds the object, then recall that agent’s fact, in one query.
The facts that the model needs are either in the prompt or fixed across training so the model can memorize them. This separates “reading from context” from “knowing from the weights.” And event histories can be longer at test time than anything seen in training, which separates “learned the rule” from “learned a length-specific shortcut.”

To make sure it was sane, I validated the known results from the literature first, at small scale (~45M params). They reproduced! A transformer fits the S5 word problem at the training length and then collapses to exactly zero beyond it. A recurrent/linear model with non-commuting state transitions² extrapolates it; one attention layer over a recurrent backbone solves canonical one-hop recall, which is the Zoology result.

This was not without surprises. FactWorld tested recall by for the value at a separated answer position, not as the next token after the key. This underperformed the expected result because it turned out this was itself a bit of a composition: you need to to know which place to look at. Moving it to a one-hop did give the expected result though.

Trying to test the composed problem introduced its own difficulties. I had a 6M param smoke test and… nothing worked at all, completely flooring the task. Luckily, at ~45M params, while a transformer still floors (zero for ten across an entire learning-rate sweep), the gated-delta recurrent hybrids could learn it. Sometimes.³ And we did get a quite interesting failure mode.

When a converged model got the composite wrong, it was usually a routing failure. The model has genuinely learned the resolve-then-recall pipeline (resolve a holder, recall a fact about them) it just resolved the wrong holder, and then confidently reported that agent’s fact. Recall is conditioned on state; they are not independent legs the model runs in parallel. Which felt pretty familiar: an agent flawlessly doing the wrong thing.

Because the binding in this composite is last-write-wins, the ordering subtlety wasn’t a particular problem. The plain Gated DeltaNet hybrid could compose it. But, in my test, only at exactly one learning rate. The Gated DeltaProduct hybrid learned it across a broad band of learning rates, and extrapolated past the training length on a majority of seeds where the single-delta variant mostly doesn’t. The product structure wasn’t necessary here; it was just easier to train⁴.

For current large models, scale can paper over all of this: learn enough patterns and you accumulate tricks that work well enough in practice. But if we want smaller, cheaper, longer-context, more reliable agentic models, getting the right architecture matters. FactWorld is hopefully a way to check, without requiring thousands of GB300s.
1. I mean, at least the ones where we know how they work ↩︎
2. Order tends to matter in these tasks, but the nature of the updates in most state-space models means it doesn’t track that order well. This specific variant, Gated DeltaProduct, handles order-specific, or non-commuting, transitions better ↩︎
3. Quite a few seeds simply never form the recall-under-composition circuit, it seemed a bit all or nothing. ↩︎
4. For completeness: state-tracking crossed with facts stored in the weights still floors at length for every architecture I tried. “Look it up in your weights, mid-pipeline”, I have no idea how to do. ↩︎
June 12, 2026
Somehow, more on distillation
The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.

Microsoft AI recently released an incredibly in-depth technical report about the development of their first model, MAI-Thinking-1. Shortly after, Nvidia released their latest open model, Nemotron 3 Ultra, accompanied by another detailed deep dive. The two approach data from somewhat contrasting directions.

Nemotron is maximally distillation-pilled. Almost every corpus in their post-training stack comes from someone else’s model: math and science from DeepSeek V4 Pro, code and kernels from GPT-OSS and DeepSeek R1, chat from GLM-5, terminal traces from DeepSeek V3.2, SWE from MiniMax and Qwen3-Coder. The general reasoning teacher is trained to match DeepSeek V4 on a mixture that DeepSeek V4 generated. Even the pre-training data is 22% synthetic web crawl, plus synthetic QA, legal and fact-seeking sets¹.

This approach to data is what you do when the capabilities are the feature, not the product. Nvidia is a GPU company. It wants the behaviors and intelligence to be as widely available as possible. Now you can use them in the original models, or, use them in an open, American-made package, which runs beautifully on Blackwell. The model is a vehicle for inference, and as a vehicle it is excellent: strong, remarkably open, and incredibly well-tuned.

At the other end of the scale is Microsoft AI, who are working from rather different principles. They want capabilities that are learned, and can be predicted. They want inputs they can control, and can carefully ladder up.

This does not mean they disavow synthetic data: MAI self-distill, generate synthetic SWE envs and tool-calling, and create synthetic instruction-following rubrics and guidelines. There is plenty of model-generated data in the mix, especially in post-training where they train a bunch of specialists then distill them into the final model². What they largely avoid is data from third-party models, particularly in pre and mid training³.

Their goal isn’t just to get intelligence out into the world, it’s to build frontier capabilities themselves, and to sell them to enterprises. For that, you need a reproducible ladder you can actually climb (the paper refers to their whole process as a ‘hill climbing machine’). They spend a lot of energy on provenance: the corpus is (human-generated) publicly available and licensed data, and they specifically strip out AI-generated and other questionable material. They put an intense amount of effort into fitting scaling laws to a ladder of small models, judging every change by how much more baseline compute you’d need to reach the same quality, and whether those gains persists as you scale.

They have to prove their models are getting better, entirely on their own terms. They trust the model because they tested exhaustively, and because they verified their tests hold with scale.

Nemotron has the opposite problem, and the opposite solution. They can see how their model is doing against the suite of models they are leveraging. Their risk is not hygiene, it is overfitting to their sources, so they spend effort on ensuring generalization. Evals like PinchBench (an OpenClaw based eval, naturally) and ProfBench are held back as gates: evaluated only after the final model and never used in development. Tasks are trained under some harnesses and then checked on ones the model hasn’t seen before. They trust the model because it clears bars it was only introduced to at test time.

If your data is clean, and you can see all of it, you predict the model before you train it and confirm your forecast. If the distribution is one you can’t deeply inspect, you instead start trying to break the thing in novel ways.

Both seem to work! There are surprisingly few apples-to-apples benchmarks between the papers, but LiveCodeBench has them in similar territory 89.0 for Nemotron, 87.7 for MAI. Researcher decisions might shape the language model, but they are also shaped by the business model.
- 1
  That, to their credit, they released.
- 2
  Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.
- 3
  Even post-training really only uses external models, mainly GPT 5, for grading.
June 5, 2026
We can distill it for you wholesale
There has been a lot of drama¹ about distillation: how (closed) frontier models are being used by other labs to boost their own performance on particularly hard tasks.

The drama is not fake, exactly. Anthropic, and recently OpenAI, have a notable lead in the agentic-coding domain, and some of that is from having data that other people don’t. Getting it is… not cheap:

SITUATION EXPLAINED: How much are frontier labs actually spending on training data?

.@SeanZCai: "Frontier labs are spending about $10 to $15 billion per lab on data."

"Really good long horizon tasks go up to $20,000 each. A complete browser-use version of SAP was rumored at… https://t.co/vnSeTRxq6K pic.twitter.com/f3j8qDUDLJ
— MTS (@MTSlive) May 29, 2026

This is why there are huge efforts going on at certain companies² to develop long form agentic trajectories. But! Not everyone has the money, or the engineers, to do that.

So, there is an incentive to maybe, allegedly, copy some homework. It’s not clear though how exactly to do that: the frontier labs generally don’t share the chain-of-thought that their models are using while they reason, which means you only have a sparse signal to train your model on.

One piece of the puzzle is in a paper from February this year, “Privileged Information Distillation for Language Models” by Emiliano Penaloza et al. at ServiceNow, which is probably not where most people are expecting the hot post-training discourse to come from. On-Policy Self-Distillation is spicy right now in post-training circles, and this is one of the earlier papers in the current zeitgeist³.

The paper’s primary contribution is π-Distill: how do you do distillation when you have Privileged Information?

“We ground our work in the task of distilling frontier models for complex multi-turn agentic settings. Typically, the industry standard for these tasks involves Supervised Fine-Tuning (SFT) on frontier model outputs followed by Reinforcement Learning (RL). Unfortunately, some model providers restrict important information, most notably the model’s full Chain-of-Thought (CoT) reasoning traces (OpenAI et al., 2024), providing only a summary alongside the action they intend to take. This opacity undermines standard distillation methods, as we can observe what successful agents do but not how they reason about it.”

The rough idea is to not use the frontier model as a teacher, but to use it as a source of that privileged information:
- You have one set model weights, run in two modes: a privileged teacher, and an unprivileged student.
- A frontier model solves a task in its tool-use harness. You may not see its chain-of-thought, but you can observe what it actually does: its action trajectory.
- That action trajectory is converted into the privileged information: tool names, tool calls with arguments, or a compact hint.
- The teacher-mode model sees the task/history plus this privileged trace in the prompt. The student-mode model only sees the task/history in its prompt.
- The teacher rolls out a trajectory and gets an RL reward⁴.
- The student is then trained with teacher forcing: calculating loss based on how likely it would be to predict the actual next token the teacher generated.
- The teacher and student losses are combined and applied to the single shared set of weights.
As the authors continue, it doesn’t even require a closed model to distill from. Other kinds of privileged information can help you do the same trick, which is the second variant of their recipe. If you don’t have an outside source but you do know some bonus details (e.g. hints on how to solve it, or critiques on prior attempts) you can pass them into the teacher:
- Let the student roll out, without the privileged information.
- Then ask the informed teacher how compatible the student’s tokens were with what the teacher would have done.
The discussion about distillation has focused on the idea of stealing some kind of secret knowledge. What this method really shows though is that distillation is about turning information that the model will not have at test time into behaviors it will have.

Like any good teacher, having a sense of how to get to the answer is going to make it easier to help your student. The “on-policy” part here is that the student and teacher are the same, the difference is the teacher is reading ahead in the study guide.

As tasks get longer, tool use gets richer, and agent traces get more valuable. The question is probably less “can labs hide the model’s reasoning?” and more “what clues can you train on?”
1. And/or marketing. ↩︎
2. Notably including the one I work at! ↩︎
3. Other good reads are the Thinky Blog and “Self-Distilled Reasoner”, which was released few days before this, and is where the name comes from! ↩︎
4. With a KL penalty that keeps it from drifting too far from the student. ↩︎
May 31, 2026
Maybe the agents shouldn’t write the kernels
A thing you can do is take the most performance and correctness sensitive part of your stack and just ask a chatbot to write it for you. They will sometimes get it right!

Back towards the end of 2024 Ouyang et al at Stanford attempted to benchmark how often that happened with KernelBench. DeepSeek R1 could one-shot 12% of simple ops, 36% of fused operators, and 2% of whole architectures. Still, things have moved on a bit in the past 18 months, and Han, Zhang et al.¹ extended the idea in KernelBench-X. They found:
1. Writing correct kernels and performant kernels is somewhat decoupled. You can refine kernels and that mostly helps with correctness: the models got more of the tasks compiling, but drops the average speedup on the way.
2. What you ask for matters more than how you ask. The category of the task explained 3x more of the variance than switching between different agents or other method varieties.
“Together, these results indicate that the capability boundary of current LLM-based kernel generation is not a single wall but a sequence of distinct barriers – compilability, semantic correctness, hardware efficiency and performance portability – each requiring different mechanisms to clear”

In one particular area they tried getting the models to write quantization kernels, an area known for needing numerical precision. They got 0 out of 30. The models produced running kernels, just not kernels that were, you know, right.

One thing that did stand out to me was that a lot of the baselines were eager PyTorch, so I decided to run an experiment myself. How do these models do against a compiler, not just eager?

I took a popular model (Qwen, naturally), ran it through torch.compile with minimal settings on my DGX Spark and identified three kernels that were eating big chunks of the wall time: SwiGLU, residual+RMSNorm and the SDPA prelude. I then had ChatGPT, Claude and Kimi² take a run at writing those kernels for the hardware.

The results were an absolute blowout: SwiGLU 1.06x, RMSNorm 1.21x, SDPA prelude 3.91x. For the latter, Kimi stacked up three different weight matrices into one and fused multiple matmuls together. It was very impressive stuff.

Then, another suspiciously well-timed paper arrived, FASTKERNELS from Snowflake. Rather than benchmarking against PT Eager or against single-operator references, they wanted to test how the models did on real model-serving problems, with a focus on the end-to-end speedups. Their takeaway:

”agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.”

All of those issues hurt end-to-end performance when you put them in a real model-serving context. Of the three strong kernel-generation agents³ they tried, none beat the production baselines

“in contrast to the supra-unity numbers these agents have reported on operator-level benchmarks whose reference is PyTorch eager.”

Taking another look at my vibed-up experiment results, as the FASTKERNELS folks may have suggested, there was a catch. Several catches.

The baseline, it turned out, spent an awful lot of time doing… kernel dispatch. Even getting 3.91x speedup on SDPA prelude led to an end-to-end model speedup of… 1.007x. Not quite as exciting.

You also had to be very, very careful about how the agents were getting speedups.

For example, the initial correctness check accepted anything within cos_sim >= 0.95 of the reference kernel. Codex “won” the SwiGLU round by replacing sigmoid(x) with clamp(0.21*x + 0.5, 0, 1), a straight line which diverges from sigmoid everywhere except a narrow band near zero.

It turns out this kind of thing is pretty common. The FASTKERNELS folks found a case where an agent needed to write an all-reduce kernel for cross-GPU synchronization. The test harness they were using was single GPU so the agent just no-op’d it, replacing the all-reduce with a straight tensor copy. This “passed its checker but produces the wrong sum on every scenario of our 4-rank NCCL+Gloo harness.”

Even when the kernels are right, and fast, it doesn’t mean they are… good? Several of the generated kernels in my experiment were somewhat unshippable due to hardcoded shapes or silent global mutations. FASTKERNELS found similar things:

“L2 failures are dominated by syntactically valid kernels that respect the per-tensor signature but violate the surrounding production contract.”

Which I think is the academic way of saying they wouldn’t ship those either.

If you get your verification of the problem wrong in the harness, you will get a kernel optimized for the harness. Use the wrong contract and your kernel will be wrong in exactly the shape of your wrongness.

Still, a small win is a win, right! My original run had agents outperforming torch.compile by about 2.6%. At that point I had a friend take a look who immediately pointed out that I had hampered the compiler unrealistically, and suggested running on max-autotune. This was especially unfair since the agents each got several cracks at the problem. Turns out, with that baseline the agents lost by 4.6%.

And, that’s pretty similar to what FASTKERNELS found. Across 88 tasks and three agents the best of theirs landed at about 0.94x⁴ the performance of the production stack.

Fairly late in the day I decided to replicate the experiments I had run on the Spark on a 3090. That’s Ampere, sm_86, an elder statesman of consumer GPUs at this point. It turns out that once again, some of the wins were just worse baselines. For example, Kimi tried the same SDPA-prelude matrix stacking as on the GB10, but on Ampere the 3.91x speedup turned into a 0.74x loss. The difference was cuBLAS: it was simply better tuned for the 3090 than the GB10, and did a much better job of utilizing all 82 SMs. The baseline Kimi had to beat was (relatively) higher.

The question of “do agents beat compilers” is hard to answer because what we are (roughly!) measuring is compiler maturity. Agents are most useful in exactly the window where a compiler is weakest: new silicon, untuned heuristics, and libraries that are still evolving⁵.

As libraries improve, hardware is better understood, and compilers mature, the value of exploratory search diminishes: there are “right ways” and it’s better to just use them than create custom solutions. If an agent is identifying patterns reliably and repeatably, it may as well author a compiler pass and spend more tokens on the areas that can’t be as cleanly captured.
1. I think these folks are associated with Tsinghua, but to be honest I am not entirely sure! ↩︎
2. Each model ran in their respective coding harnesses. One fun takeaway was the wall-time for generating the kernels was a factor too. Kimi took at least 3x longer than the other agents, spending a lot of tokens on the way, but also generated the most performant kernels of the three on every task on Blackwell, which was not what I was expecting. ↩︎
3. Codex, KernelAgent and Dr. Kernel, the latter of which I hadn’t heard of but has by far the best name. ↩︎
4. Codex landed at 0.94x. KernelAgent at 0.78x. Dr. Kernel got 0.53x, but still billed my insurance. ↩︎
5. I suspect this is particularly pointed for the GB10, which is an unusual piece of hardware, and in particular has a lowish number of SMs. ↩︎
May 27, 2026
The elusive order of things
SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.

Choosing what to run where and when is a scheduling problem, and there have been three effective approaches to that so far.

You can schedule statically: decide ahead of time what all the units should do each tick. You can schedule temporally: swapping in different phases of workers via a pipeline. Or you can schedule spatially: divide the resources of the machine into different roles.

The underlying mechanics of which one you pick tends to be determined by the hardware. A chip like a TPU spends most of its silicon on math, and fairly little on orchestrating work. That means static scheduling, and a compiler that can build you that schedule.

Ampere and before¹, and all the modern AMD chips, encourage temporal pipelining. The hardware will swap in warps (or waves) when one stalls ,and by structuring your kernels into phases you can hide memory latency and keep the chips busy.

Hopper and beyond are where spatial scheduling started mattering, in the form of warp specialization. Nvidia GPUs let you assign different register footprints to different warp groups. When you introduce warp-group scoped MMA for compute and TMA for executing data moves from a single thread you have the ingredients to divide the pipeline between groups. Instead of the same worker doing load -> compute -> store you have different workers exclusively working on different parts of the pipeline. Blackwell made this… much harder. TMEM and UMMA added new operator and memory types, so you now need to schedule movement between shared memory, tensor memory, registers, global memory, and a variety of compute units.

The problem is: how do you do that?

To stick with Nvidia for a moment, at the bottom of the stack are barriers. An mbarrier is a phase switch for a specific number of arrivals: one side waits, the other increases the arrival count. When the counter matches the expected number, the phase flips. It’s elegant and straightforward, and easy to get wrong. A classic example is the phase parity bug: if you screw up the wraparound the kernel can work perfectly at first, but then deadlock waiting on the wrong phase.

Next up, libraries like CUTLASS, and newer ones like ThunderKittens, package the patterns you tend to write. The CUTLASS Pipeline combines buffers and synchronization into a unit and makes it easy to compose common structures. This is where much of the expert-kernel-writing time goes, but that time encodes a lot of hardware-specific behavior. Hopper wants one set of patterns, Blackwell another, and even within a generation there can be differences between variants of the hardware. The more explicit the schedule is for the developer, the more they own the portability problem.

The subsequent step is to make the schedule less explicit, while still keeping the roles visible. AsyncGraphene’s ARef is a good example of this. An ARef is a reference to asynchronously produced data. Basically, a channel, with synchronization attached. A producer writes, a consumer reads, and both sides can know when the other is done. A compiler can then plan a schedule. Nvidia’s TAWA work does this explicitly for Triton, tagging producers and consumers and lowering to ARefs. TLX on the other hand, as well as systems like PipeThreader, allow defining subtasks in a kernel that a compiler can schedule.

TileIR and CuTile also enable building an explicit graph, but through focusing on the data itself. Attaching usage information on how data is read or written gives the compiler room to bundle work into tasks and reschedule.

Getting the graph is the starting point, but then you need to identify what the right schedule actually is. In practice this involves exploring different shapes and combinations to work out which is best. You can either do that explicitly through heuristics and cost models of the hardware, or do it via searching across many different possible schedules to find the ones that work best. Most systems do both.

But what do we need in a kernel DSL?

If you are building a DSL for writing kernels, the starting point is to reflect whatever the hardware does. This is not only direct, but also a necessary option because there are always smart people operating at the frontier who have a strong intuition around how to drive the most performance. They’re often targeting very new hardware which is not yet well understood (sometimes, even by the people that made it).

Beyond that, deciding what else should be on offer means answering three questions:

1. How do you think about portability?

Portability doesn’t mean “write one generic kernel and get peak performance everywhere”. But it can mean: what’s the minimal amount I can express to get correctness and a particular level of performance across hardware. Projects like Helion are explicitly operating at a high level to enable rapid research iteration. Regardless of your view on where the line for “high performance” is, you need something to define what “correct” means.

Having a good concept of a “task” seems to offer the flexibility to schedule statically, temporally or spatially, but there are a lot of edge-cases to consider.

2. What do agents change?

Humans are not going to be writing every, or even most, kernels. We have to figure out how much of that portability or performance is a deterministic search, versus how much is agentic loops exploring the space somewhat probabilistically. Agents make generating code massively cheaper. They can create candidates, run profiling on real hardware, test hypotheses and explore options.

But we also need a sense of where and how the agents fail, particularly when it differs from the patterns of humans. That includes things like verbosity: more lines (generally!) means more bugs. Performance can be both spiky and somewhat subjective; sometimes small changes can reshape the kernel’s performance, and a faster kernel might only be “correct” within specific numerical accuracy bounds.

3. How do you think about kernel boundaries?

A lot of discussion focused on GEMMs, which is understandable. But almost all real-world kernel work is across operator boundaries. FlashAttention wasn’t making the matmuls in attention fast, it was fusing them despite a reduction in the middle.

When we are writing programs we are expressing intent and providing direction. We mix that “what” and the “how”. This reflects a search vs expressiveness divide; search-oriented approaches want you to focus more on the what, expressiveness leans more into the how. The more the units inside kernels can compose across kernel boundaries, the more we can optimize across models and discover patterns automatically²,

The way I think about compilers is that they encode knowledge (in the form of rules and heuristics) about hardware. The more we can move that out of our heads, or the model’s parametric knowledge, the more we can focus our time or tokens on the parts we don’t yet understand.
1. Mostly! cp.async was introduced in Ampere and it was very impactful in making temporal pipelining work, as it let the mechanism largely hide HBM latency ↩︎
2. Whether via compiler, or agent! These problems tend to recurse, so you could have pretty much this whole discussion at the IR level too. ↩︎
May 25, 2026
Loss Exploded.
If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically:
- Loss exploded.
- Doesn’t learn.
- Loss exploded.
- Etc.
At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture (GELU to RELU for example), dealt with hardware failures, avoided bad data, and tried to debug what was going on.

That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body of knowledge on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers.

But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4, is pretty great! 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth:

“Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.”

One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always. It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts.

The techniques DeepSeek used (expert routing based on stale params, and clipping) did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out:

“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community”

Which has echoes of Noam Shazeer’s similar observation for SwiGLU:

“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”
April 27, 2026
Unbundling Work
Recently I had a conversation with an infrastructure team supporting an ML modeling group. The two orgs used to collaborate to ship experiments: the modeling team would come up with ideas, the infra team would augment their frameworks and build out tooling to make those ideas scalable. Together, they would ship an experiment every couple of weeks. Now the modeling team is largely making the framework changes and performance improvements themselves, thanks to coding agents, and are shipping a few experiments every single week. The infra team are still busy, but they are firefighting and debugging when the agents get stuck. The modeling team are much more productive, undeniably, and all the humans are busy, but the work for the infra team has ended up somewhat worse.

If you are a tech CEO who has recently returned to coding, you could look at the team doing the lower-scale firefighting and think “do I need these people?” If you keep taking that question to its conclusion you eventually ask… do I need anyone to do anything at all?

This question, helpfully, predates the term AI ¹: Back in the ’30s, Coase wrote his theory of the firm on why companies do some things in-house, and buy others from the market. For a brief period in the early 00s it looked like software jobs would go to the market, thanks to outsourcing. This largely didn’t happen, because, as Coase predicted, specifying a project is tough. Creating software is an iterative process; you don’t know exactly what you’re building until you start, so you need people with technical taste to be making decisions in a consistent way.

There are a lot of Steve Jobs stories with this flavor. For one, Jobs wasn’t happy with the jiggling when holding down icons on the iPhone to remove them. The team built a UI with sliders so he could adjust the jiggle rate until satisfied. Once perfected, copying it was easy, but assembling a group that cares about those kinds of details is hard.

One way to find those people is to train them. Gary Becker wrote about human capital back in the 60s, and in his framing some training imparts skills which are marketable; some which are firm-specific. Companies will pay for firm-specific training but are less keen on paying for marketable skills because rivals can free-ride on it by poaching employees once they are trained.

From The AI Becker Problem:

“If Company A invests time and money to turn a raw college graduate into an expert, Company B can hire that person after five years of experience for a higher salary, collecting the benefits of skills Company A paid to build. In the past, firms tolerated this risk because juniors were producing valuable work along the way. Without that value, the economic foundation of apprenticeship collapses entirely.”

This shows up is in the L3-L5 progression in big tech companies. I’ve seen many hiring managers be hesitant to hire an “industry four” as they don’t yet have the rounded, marketable skills the manager wants. But within the companies they have (effectively) apprenticed at, L4s contribute a huge amount of value. Is AI blowing up that trade-off?

The author of the AI Becker note, Luis Garicano, recently put out another paper on AI disruption asking when AI actually displaces jobs. In Garicano’s framing jobs are bundles of tasks and responsibilities; AI’s impact depends on how tightly these tasks are tied together.

“In a strong bundle, breaking the job destroys enough value that the job survives as a whole: AI assists but the human still sells the full service and retains a large share of revenue. In a weak bundle, the cost of splitting is small: AI replaces some tasks, the human role narrows, and the labor share falls.”

Software engineering involves writing code, operating services, decomposing problems, and aligning with others (both project-wise and culturally). Current AI coding agents attack part of this bundle, but humans comparatively excel at social dynamics and maintaining the larger world view necessary to know which problems to focus on.

At the senior levels the ties seem strong: you can take the coding and task breakdown out of it, but that wasn’t the main thrust of your L7-9 engineers anyway. At less vaunted levels, companies will need many fewer software engineers to churn out code than they have doing it now. But as the ML infra example earlier showed, that doesn’t necessarily mean you don’t need some of the other things they can do.

This opens a risk for the business: if you need senior folks but don’t have enough valuable work to justifying training them yourself, you are stuck paying market-rate for increasingly rare talent. Right now if you happen to have, say, scaled LLM post-training experience you can command a very significant salary. Or just start your own company.

Hiring is hard even for deep-pocketed executives when key skills are firm-specific rather than marketable. Apple can’t go out and hire the kind of people with the taste it develops internally (generally). But how much are firms willing to roll the dice on developing the next Jeff Dean, and how much are they willing to risk someone else hiring them away?

For a similar dynamic, look at investing. Over the past decades, much of the junior analyst work that undergirded investment firms has been replaced by automation. The structure that emerged was the pod shop, or more formally a multi-strategy hedge fund. They operate more like a platform that hosts “pods”, each led by a portfolio manager who is supported by analysts, data scientists and traders. Each pod has its own domain of speciality, and its own profit and loss. The firm centrally manages risk and allocates capital to pods. Successful portfolio managers earn a healthy percentage of the profits they generate, while unsuccessful pods are taken out behind the woodshed and shot. This both gives a talent development pipeline and a rigorous performance standard, albeit not a very collaborative one.

This works, in part, because there is a very clear score card, measured in dollars. We might be able to copy the structure in engineering teams, but actually evaluating how well things are going is hard!

Firms that have the highest dependence on people you can’t easily hire are exactly the ones who are at risk of struggling in this transition: they have the most need to grow their own people, and the least economic reason to do so. Apple can’t buy another Apple, and neither can anyone else.
1. Back when people still used the term Cybernetics. AI researcher drama is literally as old as AI. ↩︎
April 8, 2026
Programming The Loop
When you have enough AI, what do programmers… do? When it was smart autocomplete (e.g. Copilot), that was pretty clear: everything! The AI handles some typing. When it was interactive IDEs (e.g. Cursor) it was still a lot: pair programming, designing, writing the hardest parts. Now it’s an independent agent (e.g. Claude Code) it’s guiding, reviewing code, setting guardrails.

But, you know, we want to move faster than that! That means either we have the agent running in a loop without needing us, or we have lots of agents doing things at the same time¹. Or both.

Throwing agents at a problem doesn’t automatically solve it². Which leads back to the question: “what do we do?” The answer seems to be not so much being in the loop but designing the loop itself.

The most viral agent loop right now is Karpathy’s Autoresearch, which finds verifiable training improvements to his nanochat project. Running Autoresearch is straightforward: A human writes program.md with workflow guidelines, the agent runs in a loop trying ideas. Karpathy’s workflow allocates a fixed compute budget and constrains edits to a single training file to ensure the experiments are valid³. The agent generates ideas, verifies them, then refines: keeping the new baseline and discarding failed ideas.

While Karpathy’s agent-in-a-loop is responsible for both generating and implementing ideas, PyTorch’s KernelAgent⁴ goes multi-agent, giving each specialized roles and toolsets for improving GPU kernel performance. A profiling worker identifies opportunities, an analyzer agent suggest potential fixes, and so on. The actual execution is best-of-N sampling as an agent loop: it spawns N workers, lets them race, then plans a strategy for the next round.

“Optimization agents reflect on what succeeded and failed in each round, summarizing insights into a shared memory that guides subsequent iterations and prevents repeated dead ends.”
KernelAgent – PyTorch Blog

The pattern that seems to work is to set up agents in a generate-verify-refine loop, following a pre-defined work approach, with guardrails. If you need more parallelism, add multiple agents, but keep state central to avoid communication overhead.

An example of the latter is OpenAI’s Symphony. This moves state into a task tracker then spawns⁵ individual codex agents with a fixed budget of iterations. Individual agents write back to the tracker to save state. This type of agent usage is also known as a “Ralph loop”: agents that start fresh for each iteration of the loop, with necessary context injected each time rather than accumulating organically in the context window.

Much like with Karpathy’s program.md you “program” the WORKFLOW.md with how you want the loop to run, then it executes autonomously.

Designing the workflow feels like a genuinely different skill. It’s not writing the code, the agent does that. It’s not specifying the solution either; in many cases the agent does that too! It’s about designing an approach: how can the agent make progress with each turn of the crank? How can the environment give clean validation signal to the agent about its approach? Not easy, and not quite what we used to do either.
1. AKA agent teams, swarms, or whichever Mad Max movie Yegge is on today. Google’s new “Towards a Science of Scaling Agent Systems” paper is not keen on multi-agent systems though: “on tasks requiring strict sequential reasoning […] every multi-agent variant we tested degraded performance by 39-70%”. ↩︎
2. Condolences if your executives are currently pushing that as company strategy A. ↩︎
3. Mostly: it did engage in a bit of seed hacking, so has achieved postgrad status successfully. ↩︎
4. Disclaimer: folks in my team worked on this! ↩︎
5. It also uses Elixir for coordination in the sample code, which brings warmth to my heart. Hello Joe. ↩︎
March 10, 2026
Cutie Fly
The FlashAttention 4 paper is out and is fascinating, you should read it! One of the things that Tri called out on Twitter was that the experience of using a Python-based language (CuteDSL) significantly improved the dev loop, not just for him, but for Claude:

Claude / Codex also have an easier time writing some components of FA4 thanks to the fast compile time. I got Claude to debug a deadlock when we first implemented 2CTA fwd. It ran autonomously overnight for 6 hours, figured out part of the fix, but then went down a rabbit hole…
— Tri Dao (@tri_dao) March 5, 2026

CuTe’s layout algebra plus the quick iteration cycle of a Python DSL are a nice combination. Hence, it’s not too surprising that late last month,AMD dropped FlyDSL, which is, largely, CuteDSL for AMD. This is not a knock on FlyDSL! The project is very open about acknowledging CuTe and its provenance.

To help navigate, here is a handy translation guide:
- CuTeDSL: cute.make_layout.
  FlyDSL: flir.make_layout.
- CuteDSL: cute.composition.
  FlyDSL: flir.composition.
- CuteDSL: cute.zipped_divide.
  FlyDSL: flir.zipped_divide.
- CuteDSL: cute.make_tiled_copy_tv.
  FlyDSL: flir.make_tiled_copy_tv.
FlyDSL also calls out Colfax’s paper from earlier this year: Categorical Foundations for CuTe Layouts. This paper, along with the Integer Set Relations one from Nvidia last year, really started to establish a mathematical formalization of what had been going on in CuTe layouts. This kind of foundation enables verifying the approaches taken in fresh implementations, like FlyDSL’s.

We can actually go see that, as the whole compiler is open source. This lets you compare the composition_impl in FlyDSL to the diagrammatic version in (section 4.1.3) in the Colfax paper to understand why it works!¹

Given the blistering pace of layout algebra, we shouldn’t be surprised that just a few days after, Cris Cecka of Nvidia dropped a beastly preprint: CuTe Layout Representation and Algebra:

Colfax Research [19] analyzes CUTE layouts and some operations on them in the context of category theory. In this paper, we intend to provide a more definitive and formal treatment of CUTE concepts and their applications.

Sometimes with this kind thing it doesn’t matter if you have the idea, it’s often specific implementations of that that end up defining the standard for it. I read this paper as Cecka planting his flag and saying “y’all, this is the real CuTe”. And he cuts no corners.²

Cecka reframes layout algebra as a theory of loop transformations, showing that the objects you are transforming (Shapes Strides) and the things you are transforming them with (Shapes Strides) are the same^.3

One of the cleverest results of this is in Section 2.3.1. Cecka demonstrates that strides don’t have to be just… regular strides. If your stride is in fact a coordinate then each “step” in the stride moves in the coordinate dimension.

This is, for example, what you need for TMA on Hopper or Blackwell: you tell it the logical position in the tensor and it figures out the physical address internally, handling tiling, swizzling and bank conflict avoidance in hardware. If you stride over coordinates, you can use exactly the same layout algebra as for your computations.

Another example was that if a Stride is a bitmask you get something very like Triton’s LinearLayouts!³ That lets you compose layouts with swizzling, using the same composition operators again.

The paper is full of these interesting, but also practical, results. Cecka gives guidance on optimizations like ‘avoid ranged slicing’; (e.g. a[2:4, 1:3]) as it mixes up tile size (an optimization knob) and thread ID (a runtime index)⁴, or use layouts to algebraically work out how to auto-vectorize loads and stores rather than hard coding⁵.

There is something satisfying about paper on composition that itself pulls together ideas floating around CUTLASS internals, preprints, and alternative implementations, then shows they are all views of the same object. This will help projects like FlyDSL, Triton, and any number of other authoring libraries ground their management of one of the most painful aspects of kernel dev in a way that should make life easier, for everyone.
1. I think! My understanding of category theory is similar to my understanding of Skibidi Toilet: I get the idea, but I have so many questions. ↩︎
2. As an example, Cecka provides a wider generalization than the Colfax paper, demonstrating that CuTe layouts are not strictly closed under group composition: you can’t always compose layouts however you want. But! The failures correspond to real errors, which is the kind of restriction you actually want. ↩︎
3. Actually, you do a bit better: being strictly in F₂ means Linear Layouts are limited to powers of 2, which it turns out is a bit limiting. ↩︎
4. This makes it harder for compilers to separate static and dynamic elements. CuTe, and Fly, do this in two stages: zipped_divide to tile. then slice by a dynamic bid, allowing the compiler to optimize (e.g. constant-fold) the static tile parameter. ↩︎
5. By composing the layout with the right-inverse of the other, apparently! Or calling max_common_vector(src_layout, dst_layout) ↩︎
March 6, 2026

The model can probably write the code

The current vibes in software engineering are a mix of crushing despair at years of accumulated personal skills being displaced by the CEO prompting some stuff, and crushing despair at years of corporate investment in an existing codebase that isn’t vibe-y enough. People worry whether the models will be effective in their programming language of choice, not on some general benchmarks.

One angle to approach that is to ask how well the language is covered by the distribution of the training data¹. An interesting paper the other day gave a pretty clear idea of how to check: 1-shot some prompts against the base model and see if they ever get it right. Getting access to base models is not always possible, but you can certainly call the post-trained models with roughly the same idea: no tools, no iterations, just generate this program.

To try this, I² wrote up 20 project-euler like³ puzzles of varying difficulties and had a few different models YOLO solutions in several languages. These ranged from common ones like Python to fairly rare ones like Zig and Hack.

After validating all the solutions, we can calculate some stats using pass@k: in k trials, how often did the model solve the problem. Here’s some stats for pass@1: what % of the time can you expect the model to one-shot the solution:

Lang	GPT-4.1 Mini	Gemini 3 Flash	OLMo 3.1	Kimi K2.5	GLM-5
Python	.93	.99	.72	.97	.98
Type Script	.94	1.00	.43	.95	.95
Go	.95	.91	.46	.86	.86
Rust	.89	.94	.43	.95	.95
Kotlin	.90	.99	.29	.91	.93
OCaml	.76	.86	.08	.94	.90
Zig	.14	.55	.00	.79	.88
Hack	.43	.76	.05	.47	.68

And here is the same thing for pass@128: what is the chance it is right at least once in 128 samples:

Lang	GPT-4.1 Mini	Gemini 3 Flash	OLMo 3.1	Kimi K2.5	GLM-5
Python	1.00	1.00	.95	1.00	1.00
Type Script	1.00	1.00	.90	1.00	1.00
Go	1.00	1.00	.85	1.00	1.00
Rust	.95	1.00	.88	1.00	1.00
Kotlin	1.00	1.00	.59	1.00	1.00
OCaml	.98	1.00	.38	1.00	1.00
Zig	.49	1.00	.05	1.00	1.00
Hack	.99	1.00	.46	1.00	1.00

To make that a bit more visual, here is a per-language chart for GPT-4.1-mini:

Line graph showing pass@k curves for various programming languages with k (number of attempts) on the x-axis and pass rate averaged across problems on the y-axis. Languages include Python, TypeScript, Go, Rust, Kotlin, OCaml, Zig, and Hack.

Given enough chances GPT 4.1-mini solves all the problems, in almost all the languages. Of course, we don’t actually know what GPT 4 was trained on, but we do know what OlMo 3.1 was trained on, thanks to the wonderful folks at AI2. That means we can see how much code-specific data for each language there was⁴:

Language	Code Corpus (GB)	Est. Tokens (B)	Category
Python	60.40	17.3	High-resource
TypeScript	26.52	7.6	High-resource
Go	23.78	6.8	High-resource
Rust	9.11	2.6	Medium-resource
Kotlin	5.68	1.6	Medium-resource
OCaml	1.03	0.29	Low-resource
Zig	0.18	0.05	Low-resource
Hack	0.00	0.00	Very-low-resource

There is a pretty decent correlation between the presence of training data and the pass@k rates. But, importantly, its not 1: despite Hack having no StarCoder data and Zig negligible, the model clearly does know at least something about them. Given enough chances it has a decent chance at coming up with the correct answer for Hack, and a non-zero one for Zig:

Line graph depicting the relationship between training data volume and average pass@k scores for various programming languages, including Python, Rust, Go, and Zig, with different markers representing pass@1, pass@10, and pass@128 metrics.

We have seen for human language that models learn a language substrate, enabling them to perform strongly even on tasks they haven’t seen such as translating between unseen language pairs. I suspect something similar happens with code: despite the language differences there is a logical programming substrate, and the model doesn’t need much exposure to the language in order to generalize to it.

Once you start giving the model multiple attempts, it gets into the right region quickly for the high-resource languages: with GPT-4.1 mini, Python, TypeScript, Go and Kotlin saturate at k=10. The less-common languages continue to rise: the model can write valid OCaml or Zig or Hack but need more attempts to stumble into the right region.

Thinking models flatten the curve substantially. Kimi K2.5 and GLM 5 both use high effort by default⁵, and that appears to give them multiple bites at the apple from internally exploring and self-correcting. By k=10 the models saturate all problems on all languages, though at the cost of a remarkable number of tokens⁶!

It’s also instructive to see the ways in the which the models get it wrong. There were four patterns that showed up:

Ecosystem: One problem involved a sum of very large digits. GPT-4.1 Mini regularly used num::BigUint. This is a crate, not a standard language feature, and in an agentic loop would probably be a very valid choice but doesn’t strictly work. In contrast, GLM-5, a thinking model, implements digit-by-digit multiplication from scratch with Vec<u32>.
API confusion: The model knows roughly what the code should look like, but chooses the wrong API. For example, OlMo generated while ... do ... in mixing OCaml’s while...do...done loop with Haskell’s do notation and OCaml’s let...in binding.
Surface-form invention: The model has a sense of how things stylistically look in the language, but doesn’t know the real API. GLM occasionally writes Zig with invented functions: std.mem.Allocator.alloc(usize, limit) (Allocator is a type, not a callable) or @intCast(usize, limit), which actually was valid syntax in earlier versions.
Systematic convention gaps: Models would regularly put in <?hh for the hack samples, which broke in modern Hack.

My takeaway from this is that models learn to code, not just to reproduce syntax. That means you can almost certainly post-train or prompt your way out of most programming language problems with any frontier model: while some models were still pretty poor at Zig even with a lot of tries, Gemini most certainly was not. I doubt the folks at GDM spent a whole lot of time on Zig evals⁷.

A well pre-trained model has broad capabilities in programming, and it’s mostly a case of eliciting them rather than having to teach them.

I’m going to take as a given that models are good at generalizing within the distribution of their training data, and poor at generalizing outside it. This is not settled! Reasonable people can disagree! But, it’s a decent starting point. ↩︎
Claude. ↩︎
Not actually project Euler. I confirmed that the models never respond with an actual Euler puzzle answer in the incorrect ones, so I’m fairly (this is not good science) sure it wasn’t memorization. ↩︎
OLMo’s full training corpus (Dolma v1.7) includes a massive web crawl in addition to code-specific data from StarCoder, so the 0.00 GB for Hack means “absent from code specific training ” not “absent from all training data”. Hack documentation and other content are almost certainly present in the web crawl portion. ↩︎
Gemini also reasons, but the 2.5 Flash model was doing minimal reasoning when answering.
↩︎
Somehow averaging over 3k per sample for GLM, I say while ruefully staring at my OpenRouter bill. ↩︎
By posting this on the internet I am guaranteed to be corrected, at length, by a Googler ↩︎

February 22, 2026

TileIR
There are a lot of things folks do on GPUs (including, sometimes, graphics) so I have an approximately-correct taxonomy of operations to group them in to:
1. Dense compute: A matmul or a convolution.
2. Map: Elementwise/pointwise work on each value of a tensor.
3. Reduce: Process a tensor into fewer dimension, like a sum.
4. Transforms: Change data structure. Easy ones like transpose, annoying ones like scatter/gather.
5. Synchronize / Communicate: Move data, or wait for it (copies, collectives, fences/barriers).
At the moment people are pouring billions of dollars into hardware that primarily does 1. And, at the same time, many of the greatest minds of our generation are attempting to ensure that the hardware spends as much time doing 1 as possible.

The biggest barrier to doing a lot of dense compute is 5: moving things in and out of memory. Launching kernels, transferring data between host and device (or between devices), moving data between global memory and registers and so on. It’s like there’s a box defined by Data × Footprint ¹ × Time, and everyone is trying to keep it as full as possible.

This involves tradeoffs. You want to mul as many mats as you can, but you only have so much room to store accumulators. Fetching new data from memory also takes a while. You can keep many in-flight fetches around, but each one expands the kernel Footprint, lowering occupancy.

There are 3 tricks that we can use to help fill up the box by stitching different operations together:
- Epilogue fusion: take an elementwise op and fuse it onto the end of a dense op, so that when the MMA produces output, the elementwise op can be run while the output data is still in registers. A classic example: fuse the activation after the dense compute in a feed-forward net.
- Vertical fusion: take two subsequent operations and chain them together to avoid running a loop for one, writing it back, then running a loop for the other². A classic example is Fused LayerNorm: normally you’d need to add elements, then collect stats for the normalization. You can fuse the two to collect the stats as you add the residual.
- Horizontal fusion: doing different things over the same data, in parallel. The Q, K, and V projections in a transformer all need the exact same input, so are good candidates to fuse horizontally.
You rely on the design of the hardware to enable some of this. For example, an epilogue fusion is beneficial because it’s one kernel launch instead of two, and because the work doesn’t need to be written back to global memory, but also because the epilogue can overlap with other work.

It’s not always obvious how to put these fusions together. Flash Attention was such a breakthrough because it made dense op fusion possible. The naive attention block has a softmax in the middle: Softmax(QK^T / √d) · V. That softmax is a reduction op, which means it needs all of QK^T to be computed first, a pretty large matrix. Tri Dao and colleagues realized that if you used online softmax you could calculate the softmax for subsets of the QK matrix, and avoid materializing the whole thing. They enabled fusing the QK into the softmax and the V in one kernel, at the tile level.

Tiles are the subsection of a matrix you’re working on at any given time. In a matmul, tiles from both input matrices are loaded and multiplied, to produce an output tile. There’s a useful image of this in the Nvidia blog post on cuTile, Nvidia’s most recent entrant into the the kernel-development landscape. To side-step concerns of plagiarism, I had nanobanana plagiarize it for me:

cuTile is built on a well-specified intermediate representation called TileIR. There’s an experimental backend for Triton that lowers to TileIR too. While Triton is block-oriented rather than tile-oriented, in practice what you mostly work on in a thread-block is… a tile. TileIR elevates the tile to a first-class concept.

You can see this by generating the same kernel against the regular backend and the TileIR backend. Triton’s intermediate representation (TTIR) uses pointer arithmetic: generating offsets, computing masks, loading from explicit addresses. Here’s a bit of an inner loop of a matmul. It groups up which data it wants, loads the tiles a and b by pointer, and computes the dot product:
%offs_m = tt.make_range {end = 128, start = 0} : tensor<128xi32> %a_ptrs = tt.expand_dims %offs_m {axis = 1} : tensor<128xi32> -> tensor<128x1xi32> %a_ptrs_1 = tt.splat %stride_am : i32 -> tensor<128x1xi32> %a_ptrs_2 = arith.muli %a_ptrs, %a_ptrs_1 : tensor<128x1xi32> ... scf.for %k = %c0 to %num_k step %c1 iter_args(%acc = %zero) -> tensor<128x128xf32>: %a = tt.load %a_ptrs, %mask : tensor<128x64x!tt.ptr<f16>> %b = tt.load %b_ptrs, %mask : tensor<64x128x!tt.ptr<f16>> %acc_new = tt.dot %a, %b, %acc : tensor<128x64xf16> * tensor<64x128xf16> -> tensor<128x128xf32>
TileIR on the other hand preserves the tile as a semantic object. This snippetis doing exactly the same thing, but this representation elides the pointer math and masking:
$33: Tile[float32,(128,128)] = typed_const(value=0) accumulator.2: Tile[float32,(128,128)] = for k.1 in $36 (with accumulator.0 = $33) do $50: Tile[float16,(128,64)] = tile_load_token_ordered(array=A, index=($7, k.1), shape=(128,64)) $64: Tile[float16,(64,128)] = tile_load_token_ordered(array=B, index=(k.1, $10), shape=(64,128)) $72: Tile[float32,(128,128)] = tile_mma(x=$50, y=$64, acc=accumulator.0) continue $72
This is a nice IR (compact!), but from my perspective the most interesting part is that load function: tile_load_token_ordered.

The “time” dimension of the Data × Footprint × Time box is the hardest one to manage. Time questions separate performant kernels from slow ones: When to prefetch, how to overlap loads, and so on. Since the advent of warp specialization, the Triton compiler has been exploring pipelining options through heuristics and autotuning, and kernel engineers have been going straight to the hardware with explicit barrier API extensions like TLX³ and Gluon⁴ .

TileIR goes a somewhat different route. It assumes an unordered memory model: the order your code is written in does not determine when data is actually available. Instead, each memory operation returns a token and you attach semantics to it: read-only, write-only, read-write and so on.

By being explicit about memory dependencies you give the compiler freedom to manage the Time dimension. Where accesses don’t overlap the compiler can freely reorder them. Where they do, the token chain tells the compiler exactly what depends on what. The kernel expresses intent; the compiler maps that to the hardware.

TileIR is (mostly) targeting Blackwell right now, and the experimental backend is still early. The open question is whether we can express this smoothly enough in the syntax of kernels to actually enable taking the same kernel across hardware, or whether we are just adding some syntactic-sugar to avoid when doing hardware-specific tuning.

That said, the idea feels pretty right? The tile is the unit with which we can express what we actually mean about memory, ordering, and fusion. The CUDA programming model was always about bounded-linearity within a massively parallel framework, and this loosens the bounds that little bit more.
1. Use of available registers and shared memory, for example ↩︎
2. This is loop fusion, in compiler terms. There are other things you can do, but this is the big one. ↩︎
3. Triton Language Extensions, from Meta. As a disclaimer, these are the folks I work with. ↩︎
4. From OpenAI ↩︎
February 11, 2026

Category: posts

But what do we need in a kernel DSL?

1. How do you think about portability?

2. What do agents change?

3. How do you think about kernel boundaries?