Category: Modelling

  • LLMs are complicated now

    Back in 2022 and 2023 there were two big branches of machine learning happening at Meta1. The LLM work that led to Llama was a clean, smooth stack of repeated Transformer modules; the recommendation systems graphs were, by contrast, terrifying. Luckily, the industry has remedied that state of affairs by making LLMs a lot more complicated.

    Seb Raschka maintains an excellent gallery of model architectures. You can use it to diff two of the best open models of their respective eras, Llama 3 and Nemotron 3 Ultra.

    Attention might be all you need, but modern models certainly use a lot of different variants of it: query grouping, compressed, sparse, linear, sliding-window and more. Mixture-of-Experts added selective routing to feed-forward layers, and we have since started routing just about everything else too, from attention blocks to the residual stream. Vision and audio encoders have gone from bolted on to mixed-in, and models have scaled to run at inference time across multiple GPUs, which throws comms ops in that add extra boundaries in the middle of your model.

    This is not too different from what happened with recsys. The basic architecture of recommendation systems, for the best part of a decade, was a relatively straightforward two-tower sparse neural net. The complexity came from the tension between the need to continually increase capabilities and the need to stay efficient, particularly for inference.

    It’s tempting to assume that agents will Fix This: that you’ll hand your PyTorch or JAX definition to Claude Telenovela or whatever and have it generate optimally fused kernels2. To make that work you need a fixed, usable baseline to make sure that what is generated is… right.

    What happened with recsys was that the gap between performance being an optimization and performance being a necessity became very, very small. Conceptually you can keep a pure model definition that gives you a baseline; in practice, training and testing a model takes significant resources and performance improvements become load-bearing.

    If you want to swap attention variant A for variant B, you can afford for B to be ten percent slower. You probably can’t afford for it to be an order-of-magnitude worse. If A is fused and optimized, you need at least a partially fused and optimized version of B before you can even tell whether it’s worth exploring. The research iteration loop demands a different kind of flexibility than just “optimize this known quantity”. You can’t hand-fuse your way back without investing significant time that might not be worth it, and you can’t generate your way forward without a baseline to check. The only way out is to design for composability up front.

    One of my favorite kernel developments of the last few years was FlexAttention in PyTorch, which took a whole class of attention operations and allowed you to generate kernels for them, via Triton templates. It built on a huge body of work in attention kernels, and it was designed to be composable and verifiable up front: you can explore with only a very mild impact to performance.

    Andrej Karpathy recently joined Anthropic, in part to develop richer auto-research-style loops at the frontier. As he has spent the last few years showing, though, being able to cut architectures to their essence and make them composable is as important as a clever agentic setup in climbing that kind of hill.

    1. And many smaller ones, shout outs to all my Content Understanding and integrity peeps ↩︎
    2. Like an automated Hazy Research ↩︎
  • FactWorld

    When we started building LLMs, we mostly focused on them knowing things. They had information encoded in their weights, and they could spit it out when given sufficient prompts. But an agent doesn’t just need to know things; it needs to combine several kinds of knowledge.

    A lot of that is still in the weights: facts that the model learned during training. But some knowledge is in the context window: tool results, documents, user instructions, intermediate observations, etc. And some knowledge is in the environment: a good agent should have a sense of the current state of the world. To be useful, an agent has to be able to combine these sources of knowledge appropriately.

    There are standard ways to test some of this. Associative recall benchmarks like MQAR ask whether a model can recover a value from a key in its context window. State tracking problems, like S5-style permutations, check whether a model can keep track of changes over time: the problems are a series of operations, and a model must identify the end state.

    Different architectures solve these problems in quite distinct ways. Transformers are good at recall; in the end that’s what attention is: look back into the context, copy the relevant things. They have an inductive bias for this kind of problem: the nature of their algorithm fits the nature of the problem. When it comes to state tracking, though, they’re brittle. They memorize the state-tracking mechanism for the lengths of problem they see in training: give them something longer, and they don’t degrade so much as collapse.

    Recurrent models, like RNNs and state-space models, have the opposite shape. They have a natural inductive bias towards maintaining state. They keep a compact representation of The Current Thing and update it as tokens come in. That makes them effective at tracking state across time, but the conventional wisdom is that it costs them recall: the representation is fuzzy, and copying exact references back out of it is harder.

    One current trend in LLMs1 is hybrid models, where regular attention is interleaved with linear attention or state-space style layers. This is, usually, framed around efficiency: the linear layers don’t need the large KV cache. I wondered whether the hybrid might also give you both capabilities: strong state tracking and strong recall, in the same model, for the same query.

    To test this, I vibed up a benchmark called FactWorld. It’s a small, synthetic world of agents, objects, roles, and facts. Everything is generated from a deterministic knowledge base, with labels computed by a symbolic oracle, so every answer is correct by construction and nothing leaks from the rendered text.

    The world looks like this: agents (g0, g1, …) each carry a static fact (“g3’s a0 is v42”), and objects get passed around over time (“give o3 to g1”). The queries cross the two capabilities:

    • Recall : “what is a0 of g3?” Look up a fact.
    • State tracking: “who holds o3?” Replay the give-history; last write wins.
    • Composition : “what is a0 of the holder of o3?” Determine who holds the object, then recall that agent’s fact, in one query.

    The facts that the model needs are either in the prompt or fixed across training so the model can memorize them. This separates “reading from context” from “knowing from the weights.” And event histories can be longer at test time than anything seen in training, which separates “learned the rule” from “learned a length-specific shortcut.”

    To make sure it was sane, I validated the known results from the literature first, at small scale (~45M params). They reproduced! A transformer fits the S5 word problem at the training length and then collapses to exactly zero beyond it. A recurrent/linear model with non-commuting state transitions2 extrapolates it; one attention layer over a recurrent backbone solves canonical one-hop recall, which is the Zoology result.

    This was not without surprises. FactWorld tested recall by for the value at a separated answer position, not as the next token after the key. This underperformed the expected result because it turned out this was itself a bit of a composition: you need to to know which place to look at. Moving it to a one-hop did give the expected result though.

    Trying to test the composed problem introduced its own difficulties. I had a 6M param smoke test and… nothing worked at all, completely flooring the task. Luckily, at ~45M params, while a transformer still floors (zero for ten across an entire learning-rate sweep), the gated-delta recurrent hybrids could learn it. Sometimes.3 And we did get a quite interesting failure mode.

    When a converged model got the composite wrong, it was usually a routing failure. The model has genuinely learned the resolve-then-recall pipeline (resolve a holder, recall a fact about them) it just resolved the wrong holder, and then confidently reported that agent’s fact. Recall is conditioned on state; they are not independent legs the model runs in parallel. Which felt pretty familiar: an agent flawlessly doing the wrong thing.

    Because the binding in this composite is last-write-wins, the ordering subtlety wasn’t a particular problem. The plain Gated DeltaNet hybrid could compose it. But, in my test, only at exactly one learning rate. The Gated DeltaProduct hybrid learned it across a broad band of learning rates, and extrapolated past the training length on a majority of seeds where the single-delta variant mostly doesn’t. The product structure wasn’t necessary here; it was just easier to train4.

    For current large models, scale can paper over all of this: learn enough patterns and you accumulate tricks that work well enough in practice. But if we want smaller, cheaper, longer-context, more reliable agentic models, getting the right architecture matters. FactWorld is hopefully a way to check, without requiring thousands of GB300s.

    1. I mean, at least the ones where we know how they work ↩︎
    2. Order tends to matter in these tasks, but the nature of the updates in most state-space models means it doesn’t track that order well. This specific variant, Gated DeltaProduct, handles order-specific, or non-commuting, transitions better ↩︎
    3. Quite a few seeds simply never form the recall-under-composition circuit, it seemed a bit all or nothing. ↩︎
    4. For completeness: state-tracking crossed with facts stored in the weights still floors at length for every architecture I tried. “Look it up in your weights, mid-pipeline”, I have no idea how to do. ↩︎
  • Somehow, more on distillation

    The capabilities in a large language model emerge, mysteriously, from the training data. Everyone agrees that you start with a big pile of data, add some compute, and at the end you can vibe code. Opinions differ on what that pile of data should look like.

    Microsoft AI recently released an incredibly in-depth technical report about the development of their first model, MAI-Thinking-1. Shortly after, Nvidia released their latest open model, Nemotron 3 Ultra, accompanied by another detailed deep dive. The two approach data from somewhat contrasting directions.

    Nemotron is maximally distillation-pilled. Almost every corpus in their post-training stack comes from someone else’s model: math and science from DeepSeek V4 Pro, code and kernels from GPT-OSS and DeepSeek R1, chat from GLM-5, terminal traces from DeepSeek V3.2, SWE from MiniMax and Qwen3-Coder. The general reasoning teacher is trained to match DeepSeek V4 on a mixture that DeepSeek V4 generated. Even the pre-training data is 22% synthetic web crawl, plus synthetic QA, legal and fact-seeking sets1That, to their credit, they released..

    This approach to data is what you do when the capabilities are the feature, not the product. Nvidia is a GPU company. It wants the behaviors and intelligence to be as widely available as possible. Now you can use them in the original models, or, use them in an open, American-made package, which runs beautifully on Blackwell. The model is a vehicle for inference, and as a vehicle it is excellent: strong, remarkably open, and incredibly well-tuned.

    At the other end of the scale is Microsoft AI, who are working from rather different principles. They want capabilities that are learned, and can be predicted. They want inputs they can control, and can carefully ladder up.

    This does not mean they disavow synthetic data: MAI self-distill, generate synthetic SWE envs and tool-calling, and create synthetic instruction-following rubrics and guidelines. There is plenty of model-generated data in the mix, especially in post-training where they train a bunch of specialists then distill them into the final model2Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.. What they largely avoid is data from third-party models, particularly in pre and mid training3Even post-training really only uses external models, mainly GPT 5, for grading..

    Their goal isn’t just to get intelligence out into the world, it’s to build frontier capabilities themselves, and to sell them to enterprises. For that, you need a reproducible ladder you can actually climb (the paper refers to their whole process as a ‘hill climbing machine’). They spend a lot of energy on provenance: the corpus is (human-generated) publicly available and licensed data, and they specifically strip out AI-generated and other questionable material. They put an intense amount of effort into fitting scaling laws to a ladder of small models, judging every change by how much more baseline compute you’d need to reach the same quality, and whether those gains persists as you scale.

    They have to prove their models are getting better, entirely on their own terms. They trust the model because they tested exhaustively, and because they verified their tests hold with scale.

    Nemotron has the opposite problem, and the opposite solution. They can see how their model is doing against the suite of models they are leveraging. Their risk is not hygiene, it is overfitting to their sources, so they spend effort on ensuring generalization. Evals like PinchBench (an OpenClaw based eval, naturally) and ProfBench are held back as gates: evaluated only after the final model and never used in development. Tasks are trained under some harnesses and then checked on ones the model hasn’t seen before. They trust the model because it clears bars it was only introduced to at test time.

    If your data is clean, and you can see all of it, you predict the model before you train it and confirm your forecast. If the distribution is one you can’t deeply inspect, you instead start trying to break the thing in novel ways.

    Both seem to work! There are surprisingly few apples-to-apples benchmarks between the papers, but LiveCodeBench has them in similar territory 89.0 for Nemotron, 87.7 for MAI. Researcher decisions might shape the language model, but they are also shaped by the business model.

    • 1
      That, to their credit, they released.
    • 2
      Nemotron does a similar thing with their MOPD (multi-teacher on-policy distillation) technique. DeepSeek v4 was the first place I recall seeing this idea of train a bunch of specialists and then distill into the final model, but it seems to be another of the emerging best practices.
    • 3
      Even post-training really only uses external models, mainly GPT 5, for grading.
  • We can distill it for you wholesale

    There has been a lot of drama1 about distillation: how (closed) frontier models are being used by other labs to boost their own performance on particularly hard tasks.

    The drama is not fake, exactly. Anthropic, and recently OpenAI, have a notable lead in the agentic-coding domain, and some of that is from having data that other people don’t. Getting it is… not cheap:

    This is why there are huge efforts going on at certain companies2 to develop long form agentic trajectories. But! Not everyone has the money, or the engineers, to do that.

    So, there is an incentive to maybe, allegedly, copy some homework. It’s not clear though how exactly to do that: the frontier labs generally don’t share the chain-of-thought that their models are using while they reason, which means you only have a sparse signal to train your model on.

    One piece of the puzzle is in a paper from February this year, “Privileged Information Distillation for Language Models” by Emiliano Penaloza et al. at ServiceNow, which is probably not where most people are expecting the hot post-training discourse to come from. On-Policy Self-Distillation is spicy right now in post-training circles, and this is one of the earlier papers in the current zeitgeist3.

    The paper’s primary contribution is π-Distill: how do you do distillation when you have Privileged Information?

    “We ground our work in the task of distilling frontier models for complex multi-turn agentic settings. Typically, the industry standard for these tasks involves Supervised Fine-Tuning (SFT) on frontier model outputs followed by Reinforcement Learning (RL). Unfortunately, some model providers restrict important information, most notably the model’s full Chain-of-Thought (CoT) reasoning traces (OpenAI et al., 2024), providing only a summary alongside the action they intend to take. This opacity undermines standard distillation methods, as we can observe what successful agents do but not how they reason about it.”

    The rough idea is to not use the frontier model as a teacher, but to use it as a source of that privileged information:

    • You have one set model weights, run in two modes: a privileged teacher, and an unprivileged student.
    • A frontier model solves a task in its tool-use harness. You may not see its chain-of-thought, but you can observe what it actually does: its action trajectory.
    • That action trajectory is converted into the privileged information: tool names, tool calls with arguments, or a compact hint.
    • The teacher-mode model sees the task/history plus this privileged trace in the prompt. The student-mode model only sees the task/history in its prompt.
    • The teacher rolls out a trajectory and gets an RL reward4.
    • The student is then trained with teacher forcing: calculating loss based on how likely it would be to predict the actual next token the teacher generated.
    • The teacher and student losses are combined and applied to the single shared set of weights.

    As the authors continue, it doesn’t even require a closed model to distill from. Other kinds of privileged information can help you do the same trick, which is the second variant of their recipe. If you don’t have an outside source but you do know some bonus details (e.g. hints on how to solve it, or critiques on prior attempts) you can pass them into the teacher:

    • Let the student roll out, without the privileged information.
    • Then ask the informed teacher how compatible the student’s tokens were with what the teacher would have done.

    The discussion about distillation has focused on the idea of stealing some kind of secret knowledge. What this method really shows though is that distillation is about turning information that the model will not have at test time into behaviors it will have.

    Like any good teacher, having a sense of how to get to the answer is going to make it easier to help your student. The “on-policy” part here is that the student and teacher are the same, the difference is the teacher is reading ahead in the study guide.

    As tasks get longer, tool use gets richer, and agent traces get more valuable. The question is probably less “can labs hide the model’s reasoning?” and more “what clues can you train on?”

    1. And/or marketing. ↩︎
    2. Notably including the one I work at! ↩︎
    3. Other good reads are the Thinky Blog and “Self-Distilled Reasoner”, which was released few days before this, and is where the name comes from! ↩︎
    4. With a KL penalty that keeps it from drifting too far from the student. ↩︎
  • Loss Exploded.

    If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically: 

    • Loss exploded. 
    • Doesn’t learn.
    • Loss exploded.
    • Etc.

    At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture (GELU to RELU for example), dealt with hardware failures, avoided bad data, and tried to debug what was going on. 

    That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body of knowledge on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers.  

    But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4, is pretty great! 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth:

    “Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.”

    One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always. It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts.

    The techniques DeepSeek used (expert routing based on stale params, and clipping) did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out:

    “Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community”

    Which has echoes of Noam Shazeer’s similar observation for SwiGLU: 

    “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”

  • Everything MoE

    There are two really good ways to learn the deep fundamentals of a field. One we could call the Carmack/Ilya method: get an expert to give you a list of the seminal papers, systematically work through them, and in the process develop a deep, grounded intuition. This seems to work. The second is: funny tweets.

    A case in point:

    Other than the fact you have to be in a very particular niche in order to understand all the acronyms in that tweet, the idea that everything is an MoE feels right? Pretty much every notable model release, and probably all the secret frontier models, are MoE.

    Like every other idea in deep learning this goes back to something Hinton did in the 90s, specifically the paper Adaptive Mixtures of Local Experts by Jacobs, Jordan, Nowland and Hinton:

    If backpropagation is used to train a single, multilayer network to perform different subtasks on different occasions, there will generally be strong interference effects that lead to slow learning and poor generalization. If we know in advance that a set of training cases may be naturally divided into subsets that correspond to distinct subtasks, interference can be reduced by using a system composed of several different “expert” networks plus a gating network that decides which of the experts should be used for each training case. […] The idea behind such a system is that the gating network allocates a new case to one or a few experts, and, if the output is incorrect, the weight changes are localized to these experts (and the gating network).

    The idea is that if your data naturally clusters, then having separate networks avoids smearing understanding across the weights. A dataset with both German and English training data might produce a model that mixes up both languages. If we train two different experts and learn a gating network, we can get a clean “German-speaking” model, and a clean “English-speaking” model, in one.

    Also, like every other idea in deep learning, this was very clever, but painful to train. In particular, this was because the decision about which expert to choose was a bit of a cliff. If you choose the German expert when you needed the English expert then the German expert would get some loss, but the English expert would get none. This could lead to the awkward situation where the German expert performed better for both English and German: you ended up with a smaller, smeared model, and a dead expert.

    Noam Shazeer and co came to the rescue in 2017 with the excellently titled “Outrageously Large Neural Networks”. They introduced concepts that didn’t fundamentally change the approach, but did make it practical.

    The key trick was adding an auxiliary loss that penalized the model for using one expert over the others. By adding some noise to the gating decision they helped it be differentiable and ensure errors could flow back effectively. This gave the training process a much better chance of avoiding this kind of “winner-takes-all” collapse.

    Over time these methods were refined. In a contemporary MoE like DeepSeek v3, sigmoid-based routing removed the noise from the gating and the auxiliary loss is replaced in favor of a what they call bias updates: they just put their thumb on the scale during training if some experts aren’t getting enough samples, which seems to work great.

    All of that is about how we got MoEs to scale, but doesn’t really say… why? Intuitively, if you can train a model with X parameters, it seems like it would be better to have all of them doing something (a dense model), rather than some subset1?

    The main reason this has taken over the field is it is a way of decoupling capacity (how much can the network “know”) from compute (how much work does it do for each input).

    In a dense model when you add a new token to train you send it to all parts of the model: every bit of capacity touches it, each of which uses some compute to process. MoEs are a form of sparsity: a way of ignoring some of the parameters. They let you add capacity without adding compute2.

    There are other ways of achieving the same result, but the MoE approach is very hardware friendly. You’re still mostly doing dense matmuls, just split between experts. In parallelism terms, Expert Parallelism is efficient because you’re moving tokens between devices: it needs an all-to-all, but the data volumes are manageable.

    The tweet calls out NSA, engram and mHC, all recent papers from Deepseek. But underneath it calls out the design pattern: make a few alternative compute or memory paths, then use a learned gate to pick (or mix) a subset of them, per token. You get sparsity at the routing level, decoupling formerly coupled aspects, while each path can remain fairly dense and hardware-friendly.

    Engrams makes the argument that language models have to do two things: reasoning and looking stuff up. The reasoning works great with stacks of Transformers, but the looking-stuff-up part is approximated through computation rather than just… looking stuff up.

    This process essentially amounts to an expensive runtime reconstruction of a static lookup table, wasting valuable sequential depth on trivial operations that could otherwise be allocated to higher-level reasoning.

    Classically, Natural Language Processing used a lot of N-grams: representations of more than one token at a time, but language models pretty much dropped that in favor of a fixed vocabulary. Deepseek is bringing it back. These extra embeddings are retrieved for subsets3 of the tokens in the context window, the resulting vectors are summed4, then the model gates how much to incorporate the information based on the current state.

    It’s the same move of decoupling compute and capacity. Here they are adding a bunch of extra storage parameters but letting the model learn whether or not to use them. Because the retrieval is based on tokens the table doesn’t have to live in VRAM but can be loaded with the input5 .

    The second paper, Manifold-constrained Hyper Connectors is the most math-heavy of the recent release, and it builds on one of the most cited papers in ML: ResNet.

    In the bad old days ,the “Deep” in Deep Neural Nets didn’t really exist: you could theorize, but if you tried to train one you’d get into a place where the early layers received basically no useful loss signal. ResNets fixed this in the simplest way possible: As well sending through the “output” of a layer, you sent through the input as well. This gave an efficient highway for loss gradients to flow back and enabled successfully training much, much deeper models.

    mHC builds on an observation that ResNets hard-code another compute/capacity tradeoff: the size of the residual channel. If you think of a layer of a transformer: it has an input of C tokens, and an output the same size. The residual connection works by summing the input tokens and the output tokens. That’s assigning as much information capacity to the residual channel as you do to the processing channel. E.g.

    • Layer 0 gets raw tokens, and outputs a sum of raw+contextualized tokens
    • Layer 1 gets layer 0 tokens and outputs a sum of layer0+contextualized tokens
    • Etc.
    • At the end you get a cake recipe

    But maybe that cake recipe would be better if Layer 2 had access not just to the layer0 tokens, but also to the raw tokens? We don’t really have a way to express that outside of adding extra skip connections. Hyper Connections widen the ResNet channel into multiple lanes, and mHC lets the model decide what to put in each: so you could have layer 1 putting layer0 context in one lane, and raw tokens in another lane6 . If MoE lets you take a bunch of parameters and selectively route tokens to a subset, then mHC lets you take a bunch of residual bandwidth and selectively mix the information flow from your module to a subset of it.

    Finally, Native Sparse Attention follows the classic Deepseek move of throwing a bunch of engineering wins together. Instead of assuming the amount of attention compute for each token in is the same they are scaling it dynamically based on the content itself. They mix the outputs of a pooled version of the content window to get a compressed representation, a MoE-style gated selection from the full context window7, and a classic sliding window attention.

    This is the pattern MoE exemplified:

    • look at what is constrained
    • add more of it, but make it conditional to avoid scaling other things at the same time

    It’s a thread that runs through an awful lot of the industry right now. Understanding that is useful when anticipating where the things are going to go next.

    Or, you could have saved yourself a lot of time and just liked the tweet.

    1. MoEs do have some inference advantages: if you have a 100bn parameters model where just 20bn are active for a given token you simply have to do less work than a 100bn param dense model. That’s a win for latency! But, you still have to store all those 100bn parameters, meaning you need quite a lot of memory kicking around. ↩︎
    2. More specifically, they make the ratio of adding capacity and adding capacity very flexible: modern MoEs often have many experts and activate several at a time. ↩︎
    3. In this case Deepseek uses 2-gram and 3-grams ↩︎
    4. Weighted summed ↩︎
    5. In practice they inject the ngram embeddings at a couple of different points later in the model, where empirically there seemed to be enough context for the model to make useful mixing decisions ↩︎
    6. The specific clever thing the Deepseek folks added was a constraint to stop this from exploding, using the wonderfully named Sinkhorn-Knopp algorithm (apparently) ↩︎
    7. Based on those pooled tokens. Effectively its taking the “summarized” context window, and using runtime gating to decide which bits of the context window to add in full. ↩︎
  • Attention, Compression & Predicting the next token

    Language modelling is one of the great ideas in ML: if you train a model to accurately predict the next word in a sequence of text1, you are forcing it to learn a deep structure for human language. Because language is how we map reality, hopefully then you can do many useful things. This turned out to be right!

    The challenge with actually, you know, doing this is that text is messy. It’s sequential, variable length, and has structure, but the structure is kind of weird: the phrase “the cat, a mellow long-haired persian, sat on the mat” very clearly associates “sat” with “cat”, but the actual words are quite far away2.

    Dealing with sequential, variable length data with a fixed network is a bit of an inherent mismatch. In training you often know the sizes you’re dealing with, but at inference time it’s variable. One elegant solution to that was the Recursive Neural Net (RNN): start at the beginning, read one word at a time and keep a “hidden state” as a scratch pad to provide memory of what has come before.

    Training RNNs was painful, because now you have to backpropagate over multiple steps, and it was a minefield of vanishing and exploding gradients. The hidden state was used for two different things: the long-term memory of the whole sequence and as the key to the next word.

    Getting to Attention

    The architecture that really addressed this was the LSTM: instead of a single memory they split short and long-term memory and added activation functions to keep the gradient updates sane. They also made the updating the memory a function of the input rather than of the weights by adding learnable gates that let the model decide which parts of the input to remember, and what information from the memory to forget. This unlocked real sequence-to-sequence models, which proved immediately useful in areas like machine translation: one model reads a sequence and compresses it to a hidden state (the encoder), another generates new output based on it (the decoder).

    This solved the training stability bottleneck, and introduced a new one: compression. The entire sequence got compressed to a single hidden state, which limited how much complexity could be captured.

    Bahdanau et al. addressed that with the idea of attention in 2014. The hidden state gets updated in the encoder with each new word, so why not keep all the hidden states around? Then, have a small network score which hidden states are relevant to the current decoder state, and make a new contextualized input to the decoder that is a weighted sum of the encoder states. This was called “attention” as it allowed the model to put different amounts of focus on different parts of the input sequence.

    The new bottleneck though was throughput: to generate hidden state n, you first needed hidden state n-1. That made it hard to parallelize, which made it hard to take advantage of emerging accelerators. Luong et al first showed that you could simplify the state scoring to make it more hardware friendly, then Attention Is All You Need in 2017 stripped away the recurrent part entirely. In the Transformer architecture they got rid of the RNN and hidden state, replacing it with another version of the attention mechanism: self-attention.

    Rather than a stack of hidden states that progressively encode the state of the sequence, each incoming word is transformed at once into a contextualized representation that carries information about it and its surroundings. This was really parallelizable; you don’t need to care about previous time steps to make decisions, so you can scale the computation on GPUs and other accelerators.

    In regular attention you can think of the current decoder3 state as a query, and the various encoder hidden states as keys: the scoring function would generate a value for each pair of key and query. In self-attention, all the tokens were projected through key and query networks, and the query for each token was compared to the key of all the others. The transformer also added a value projection: in the older attention the “key” from the hidden state was both “what makes a good match” and “what information the token provides”, but in the transformer the two were decoupled.

    The new bottleneck that emerged was performance, particularly during inference. Comparing everything to everything else is an O(n2) operation. During training you can ameliorate some of that through batching, but you’re directly exposed in inference. And, unlike an RNN, increasing the sequence length (aka context length) gives you a quadratic increase in time, not linear.

    There were various attempts at addressing this one too. In “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention” back in 2020, Katharopoulos et al showed that the quadratic aspect of self-attention comes from having to materialize a big matrix to calculate the softmax for scoring. If you replace the softmax with a map-type function you can chunk the computation and get linear time performance. This was mathematically elegant, but didn’t actually work very well, so more engineering-oriented approaches like KV caching and FlashAttention were the main-stay for tackling the bottleneck.

    So why talk about this now? Because of Moonshot AI, and their excellent Kimi models. Moonshot are perhaps the frontier-est of the Chinese tiger labs, and their recent model releases have involved: Kimi Linear: An Expressive, Efficient Attention Architecture

    The architecture mixes regular, self-attention layers with Kimi Delta Attention. And Kimi Delta Attention is just the latest in a thread of evolution which goes back (sorta!) to RNNs.

    State space models

    For a long time, folks modelled control systems using state-space models. These return both an output and a state, and have a linear update function. RNNs such as LSTMs weren’t strictly state-space models in part because of their use of non-linearities: when updating the memory LSTMs used a tanh activation, for example. If you hand-wave a bit and ignore that, you’re looking at a very similar process.

    But there is a gap between hand-waving and science, and luckily someone crossed it. The benefit of that activation function was that it squashed the state into a known range and avoided the vanishing gradient issue that plagued RNNs. The key realization was that you can drop the non-linearity entirely4 as long as the weight matrix that multiplies the hidden state is well behaved (specifically, has eigenvalues close to, but less than, one).

    Much of this is in the HiPPO and S4 papers, with Albert Gu, Chris Ré and Tri Dao. This was another neat idea, which included a clever bit of linear algebra with a technique called Diagonal+Low Rank to make the state updates relative efficient, but didn’t perform as well as regular transformer models. Gu and Dao identified the challenge as those well-behaved weights that updates the hidden state. Much like with RNNs prior to LSTMs they were adding a fixed amount of information from the input to the state. In Mamba they reused the same kind of trick: adding a small network to gate the updates so the model can learn remember more, or less, depending on the specific input5.

    Then, in the Mamba 2 paper from 2024, Gu and Dao brought everything together. They showed that the 2020 style linear attention, with a decay mask, was the same as a structured state space model like Mamba 1. That means they could apply the same chunking tricks in linear attention and get much better scaling and training, but with the ability to handle long sequences Mamba had.

    The slow recreation of LSTM features in more scalable forms continued with Gated DeltaNet. The Mamba approach ‘faded’ old memories via a decay, but it couldn’t explicitly subtract information like the LSTM forget gate. Gated DeltaNet also calculated the difference (the delta) between the expected and actual state, allowing it to effectively edit the memory rather than just overwriting it6.

    Kimi Linear sped this up, and improved the fading mechanism to be per-dimension rather than a single rate across the memory:

    “Crucially, KDA parameterizes its transition dynamics with a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) matrices [30, 71], enabling a bespoke chunkwise-parallel algorithm that substantially reduces computation relative to general DPLR formulations while remaining consistent with the classical delta rule. Kimi Linear interleaves KDA with periodic full attention layers in a uniform 3:1 ratio.”

    They manage to solve two birds with one stone linear algebra: They reused the DPLR trick from S4 let you take a diagonal vector for the update rate and apply it across the matrix product of a low-rank approximation for the state transition. Moonshot realized that you could replace the approximation with the K and V matrices directly, which is much more efficient, and that you could have the diagonal come from a vector of the same dimension, so you get per-channel forgetting.

    Compression & Recall

    It seems likely we will see more sophisticated mixing of different types of attention in models as labs continue improving architectures. We started with recursive models as a natural expression of the problem, moved to transformers for scale, and have been slowly integrating the two expressions together. We are still just trying to predict the next word, but it turns out the best way to do it is to remember some things, forget most things, and accept that the map is not the territory.

    Reading through the papers on this journey really highlighted how the field moves between compression and breadth of recall. Sometimes researchers get a bad rap from their engineering brethren for being disconnected from reality, but this chain of evolutions is a pragmatic one.

    You want to get the most intelligence in the model as possible. That’s done by compressing the training data into efficient, useful and general representations, but finding those representations is hard! If you hit a limit in finding them, then one approach is to simply add more knowledge: add more parameters, consider more training data, and build more of the imperfect representations to give you more options to choose from.

    MoEs, synthetic data, and various other aspects of modern model training are playing with this same trade off: represent better or represent more. After his recent HotChips talk, Noam Shazeer was asked how we can find more efficient ways of encode knowledge into parameters, closer to how the brain does it. He responded first by asking the questioner: “why are you limited on parameters?”

    1. The idea dates back to Jeff Elman, I think, who showed that training a network on this objective caused the network to learn grammar categories and other features of English. ↩︎
    2. This kind of thing is even hard for humans at sufficient lengths of text: there is a version of War & Peace in English that is largely the original (translated, natch), but normalizes all the character names as they were such a common point of confusion ↩︎
    3. In the original paper they kept the same encoder/decoder set up as with earlier models, as its eminently sensible for translation tasks. The GPT models and others demonstrated you could go decoder-only effectively. What we tend to call “prefill” these days is effectively a (causal) encoder step within the decoder model that contextualizes the input, then the “decoder” is the autoregressive generation process after. ↩︎
    4. There actually still is non-linearity, as you need it for neural networks in general but rather than doing it in the loop memory update, it happens in projection MLPs after the layer. Then in Mamba it moved into the gating, so it’s only dependent on input, not the h_{t-1} state! ↩︎
    5. And it was Orvieto and the DeepMind folks that showed that you can get the same results in an RNN without the non-linearities if you can set up the matrix right. ↩︎
    6. Part of this reason was recall, which Jamba addressed. Because the RNN approach is inherently compression based it was harder to just cut and paste sections of the context when they were relevant. Jamba mixed regular attention layers with Mamba layers, giving back the global context while still providing better scaling. The specific recall problem is really emphasized by the fact that one of the standard long context evals is the “needle in a haystack” task, where a relevant fact is hidden in a long doc and needs to be pulled out. ↩︎
  • Qwen-Image

    GPT4o’s image generation was a remarkable event, beyond the brief Ghiblification of all social media.GPT-4o offered significantly more steerability than earlier image generation models,, while offering image quality in the ball park of the best diffusion models. Qwen-Image gives a similar level of fidelity and accuracy and is an open-weights model with a pretty decent technical report: QwenLM/Qwen-Image.

    While I was fairly familiar with diffusion models, I wasn’t really familiar with the backbone of this model, the multimodal diffusion transformer (MMDiT). Rather than just look at it, I vibed up a repo with Claude Code that went step by step through the architectures, training on good old MNIST. ianbarber/diffusion-edu — which spat out this:

    This ended up being a helpful way to go step by step through the evolution of diffusion models. 

    Loss/Target

    Modern image generation really kicked off with GANs. GANs were a clever idea that exploited the fact that we are better at building classifiers than generators by using one to bootstrap the other. A generator would generate an image against a reference, the discriminator would be given the generated image and the reference and have to predict which was the real one, and both networks were scored on how well they did on their tasks. This was effective, but was challenging to train. The generator also had to start from somewhere and what it effectively started from was noise: the generate would start with fairly random output and the discriminator would learn to identify noise vs the real image. 

    The clever idea Jonathan Ho and co had with DDPM was to focus on that noise: what if instead of learning to generate images we learned to remove noise from images. In the snippet below we:

    • Pick a timestep between 0 and 1000
    • Generate some noise
    • Add an amount of noise to the training image proportional to the timestep
    • Get the model to predict the noise, given the time step
    • Calculate the loss as the mean squared error between the known noise and the predicted noise
    # Sample random timestep
    t = torch.randint(0, 1000, (B,), device=device)
    
    # Add noise to image
    eps = torch.randn_like(x0)
    alpha_t = self.alpha_schedule(t)
    xt = sqrt(alpha_t) * x0 + sqrt(1 - alpha_t) * eps
    
    # Predict the noise we just added
    eps_pred = self.model(xt, t, cond)
    
    return F.mse_loss(eps_pred, eps)  

    This pretty much worked! You needed to use quite a few timesteps (around 1000), but the model would learn to discriminate noise from data. Then, you can reverse the process to generate: given a noisy starting point, generate some noise,  predict the noise at the first timestep, remove it, increment the timestep, then repeat, each time adding some noise and removing. 

    Song et al. followed this up with DDIM, identifying that one of the reasons you need so many samples is that you are injecting new noise each generation. If you start with the noise up front when sampling you have a much more deterministic process, and can generate in more like 50 steps than 1000: 

    x = torch.randn(*x_shape)  # Start with pure noise
    
    for i in reversed(range(steps)):
      t = torch.full((B,), i/steps)
      if target == TargetMode.EPS:
        eps = model(x, t, cond)
        x = (x - eps * dt) / sqrt(1 - dt)

    The next step, in 2021, was Classifier-Free Guidance from Ho and Salimans. The clever idea was to pass a conditioning variable through to the model: for example in our MNIST example it could be the digit we are learning from. However, during training we would sometimes zero it out. This means the model learns to generate conditionally (for the specific digit) and unconditionally (just in whichever direction looks the best). 

    if cond is not None and self.cfg_dropout_prob > 0:
      mask = torch.rand(B, 1, 1) < self.cfg_dropout_prob
    
      cond = cond * ~mask  # Zero out conditioning randomly
    
    return F.mse_loss(self.model(xt, t, cond), target)

    This gets useful at generation time. When sampling, we can sample both conditionally and unconditionally, and diff out the unconditioned part: 

    # Run model twice: with and without conditioning
    cond_pred = model(x, t, cond)
    uncond_pred = model(x, t, None)
    
    # Amplify the difference
    return uncond_pred + cfg_scale * (cond_pred - uncond_pred)

    If you imagine the sampling process as denoising, this is saying there is the “best” direction, given the condition, and the “best direction” overall. By reducing the influence of the overall best direction, we get clearer steerability, and effectively the model serves as its own iterative classifier. 

    Also in 2021, Song et al published Score-Based Generative Modeling through Stochastic Differential Equations. They framed the diffusion problem as a Stochastic Differential Equation (SDE), effectively a regular differential equation dx = f(x, t)dt with an additional noise term: dx = f(x, t)dt + g(t)dw1. That latter term g(t) controls how much random noise is involved.

    The contribution from the paper is that they worked out how to reframe this without that dw noise – e.g. they turned it into an “Ordinary” Differential Equation (ODE) without the random component. Then the model can be viewed as a velocity field that ends up having a similar shape to the one modelled by the random noise version, but is deterministic.

    Salimans & Ho were not done, and proposed another improvement to loss in V-Parameterization in the Imagen paper. One of the challenges with predicting the noise (eps above) is that when you get pretty close to a finished image there isn’t much noise, so the prediction isn’t particularly good. Similarly, when you are starting with pure noise it’s predicting almost everything, so also doesn’t give much information. Predicting the noise involves estimating both the clean sample and the noise. Some reordering lets you predict a single value, the velocity field (or gradients) which combines the clean sample (alpha), the noise (eps) the time step and the current (noised) sample. By having the model predict that we can balance between predicting the image and the noise, giving better results better at extremes. 

    v_target = alpha_t * eps - sigma_t * x0
    v_pred = self.model(xt, t, cond)
    
    return F.mse_loss(v_pred, v_target)

    Finally (on the loss) we get to flow matching from folks at Meta FAIR (Flow matching) and UT Austin (Rectified Flow). Rather than making the target a blend of start and noise, why don’t we just predict the straight path to the data. Compare the v_target below to the one above: 

    t = torch.rand(B, 1, 1, 1)
    z = torch.randn_like(x0)
    
    # Straight line: xt = (1-t)*x0 + t*z
    xt = (1 - t) * x0 + t * z
    
    # Learn the velocity field pointing from noise to data
    v_target = x0 - z  # The straight path direction
    v_pred = self.model(xt, t.squeeze(), cond)
    
    return F.mse_loss(v_pred, v_target)

    Flow matching models often converge faster during training and can generate good samples with fewer steps. They also tend to have more consistent quality across different sampling step counts.

    Architecture

    All of that evolution was about the loss function and sampling, and we haven’t really discussed the model architecture itself. The original diffusion models used an approach called Unets: a series of convolutions that compressed the (latent) visual information into fewer dimensions, then expanded it back up (giving a sort of U shape). But post-ChatGPT the Transformer was ascendent, so in 2023 Peebles and Xie proposed swapping out the Unet for a stack of transformers in the Diffusion Transformers (DiT) paper. 

    class DiTTiny(nn.Module):
        def __init__(self, embed_dim=256, depth=6):
            # Patchify the image (like ViT)
            self.patch_embed = PatchEmbed(patch_size=2)
    
            # Stack of transformer blocks
            self.blocks = nn.ModuleList([
             TransformerBlock(embed_dim) for _ in range(depth)
            ])
    
        def forward(self, x, t, cond=None):
            # Convert image to patches
            x = self.patch_embed(x)  # (B, num_patches, embed_dim)
    
            # Add positional encoding
            x = x + self.pos_embed
    
            # Transform through attention layers
            for block in self.blocks:
                x = block(x, t_emb)
    
            # Reshape back to image
            return self.unpatchify(x)

    This looks like a regular transformer but with patches (segments of the image) rather than text tokens, as in ViT understanding models. The transformer block will also look pretty familiar 

    class TransformerBlock(nn.Module):
      def __init__(self, dim, heads=8, mlp_ratio=4.0):
        super().__init__()
        self.ln1 = nn.LayerNorm(dim)
        self.attn = nn.MultiheadAttention(dim, heads, batch_first=True)
        self.ln2 = nn.LayerNorm(dim)
        self.mlp = nn.Sequential(
          nn.Linear(dim, int(dim*mlp_ratio)), nn.GELU(), nn.Linear(int(dim*mlp_ratio), dim)
      )
    
      def forward(self, x):
        h = self.ln1(x)
        x = x + self.attn(h, h, h, need_weights=False)[0]
        x = x + self.mlp(self.ln2(x))
        
        return x

    They got good results and more importantly it was easier to scale up to more compute and larger inputs. For what it’s worth, I found DiTs a bit tricky for training on small data sets (like the mnist example), but didn’t spend much time on it, since: 

    MMDiTs emerged in 2024, and were used for Stable Diffusion 3 and Flux, largely setting the standard in terms of image quality. The idea is to process images and text in parallel with the ability to attend across each other, reminiscent of cross-encoder models.

    class MMDiTTiny(nn.Module):
        def __init__(self, img_dim=256, txt_dim=256):
            # Separate encoders for each modality
            self.img_encoder = nn.Linear(patch_dim, img_dim)
            self.txt_encoder = nn.Linear(txt_dim, txt_dim)
    
            # Joint transformer blocks
            self.blocks = nn.ModuleList([
                CrossTransformerBlock(img_dim, txt_dim) for _ in range(depth)
            ])
    
        def forward(self, img, t, txt=None):
            # Process both modalities
            img_tokens = self.img_encoder(patchify(img))
            txt_tokens = self.txt_encoder(txt) if txt is not None else None
    
            # Bidirectional attention between modalities
            for block in self.blocks:
                img_tokens, txt_tokens = block(img_tokens, txt_tokens, t)
    
            return unpatchify(img_tokens)

    MMDiT models demonstrate great prompt adherence and can handle complex requests. The bidirectional flow means text understanding improves alongside image generation.

    class CrossTransformerBlock(nn.Module):
    """Cross‑attention: query=image tokens, key/value = text tokens."""
    
        def __init__(self, dim_img, dim_txt, heads=8, mlp_ratio=4.0):
            super().__init__()
            self.q_proj = nn.Linear(dim_img, dim_img)
            self.k_proj = nn.Linear(dim_txt, dim_img)
            self.v_proj = nn.Linear(dim_txt, dim_img)
    
            self.attn = nn.MultiheadAttention(dim_img, heads, batch_first=True)
    
            self.ln_q = nn.LayerNorm(dim_img)
            self.ln = nn.LayerNorm(dim_img)
            self.mlp = nn.Sequential(
                nn.Linear(dim_img, int(dim_img*mlp_ratio)), nn.GELU(), nn.Linear(int(dim_img*mlp_ratio), dim_img)
            )
    
        def forward(self, x_img, x_txt):
            q = self.q_proj(self.ln_q(x_img))
            k = self.k_proj(x_txt)
            v = self.v_proj(x_txt)
    
            x = x_img + self.attn(q, k, v, need_weights=False)[0]
            x = x + self.mlp(self.ln(x))
    
            return x

    Here, in the cross attention block the image is used for the Query part and the text for the K and V parts of the attention. The results are combined with the image input. 

    Putting this all together, you can see the evolution of the common diffusion baselines across both scale and steerability:

    1. DDPM: Clean but slow. The baseline everything else improves on.
    2. SD1-style (UNet + Epsilon + CFG): The first practical system. Good quality, reasonable speed, follows prompts well with CFG.
    3. SD2-style (UNet + V-param + CFG): Slightly better contrast and stability, especially at high resolutions.
    4. SD3-style (MMDiT + Flow): The current state-of-the-art. Fastest training, best prompt adherence, most efficient sampling.

    Back to Qwen

    The Qwen-Image model is a good, practical example of scaling this up. It uses an existing multimodal model2 () to encode text and image inputs, a pretrained VAE3 for translating between pixel and latent space, and then as its backbone an MMDiT. The use of strong (understanding) models for encoding helps really enhance the steerability of the results in the MMDiT. 

    In the MMDiT sketch above we just concatenate image and text together. In real systems we first add the positional embeddings for the image tokens, then add on text tokens. This works, but made it difficult to adapt to different image resolutions.

    Seedream introduced Scaling RoPE4 that instead puts the image positional encoding in the middle of the image, treats the text tokens as 2D shapes [1,L], then applied 2D RoPE to both text and image tokens. This worked better, but had some problems where the positions were confusable between text and image latents, meaning the model couldn’t properly differentiate in some cases. The Qwen team updates this by implementing positional encoding across both dimensions of the text tokens, and concatenating the text along the diagonal of the image:

    This design allows MSRoPE to leverage resolution scaling advantages on the image side while maintaining functional equivalence to 1D-RoPE on the text side, thereby obviating the need to determine the optimal positional encoding for text.

    The resolution independence is important for the training recipe. The model is progressively trained  with images starting at 256×256 and increasing in steps up to 1328x, in a variety of aspect ratios. They follow it up with post-training consisting of SFT on organized, high quality image-text pairs and DPO against preference pairs judged by human raters5. Finally, they do a GRPO stage with a “reward model”: though it isn’t clear if that’s based on the aforementioned preference data or is some other secret sauce. 

    While we don’t know how GPT-image is trained, this recipe certainly gave some comparable results. I was surprised to learn that the combination of a strong text and image encoding model plus MMDiT6 gives this level of steerability and fidelity. As usual, it’s exciting to have open models and papers to bring these concepts together! 

    1.  Its w because the noise is a Weiner process, also known as standard Brownian motion. I am heavily conditioned to think of this as the motion in a cup of tea thanks to HHGTTG
      ↩︎
    2. Qwen 2.5-VL ↩︎
    3. Interestingly, a video one from Wan-2.1 ↩︎
    4. Roughly the same idea was about as Column-wise position encoding as I understand it. 
      ↩︎
    5.  The same prompt with two different seeds, and — if present — a reference image
      ↩︎
    6. And a lot of very carefully curated and programmatically generated data, to be fair
      ↩︎
  • Rubrics

    Rubrics

    Pre-training is about making AI correct, post-training is about making AI helpful1. That helpfulness is (primarily) shaped by reinforcement learning. RL for LLMs really took off with RLHF (RL from Human Feedback), which trained based on the score from a reward model.

    The reward model was designed to score responses based on how well they met certain preferences, and the preferences were inferred from a set of human ratings: the graders were told what to look for in pairs of responses, and the reward model was trained to predict what they would pick. This worked, but was gated on how much signal you could get into the reward model and hence how many humans you had to generate preference data.

    RLAIF (RL from AI Feedback) naturally extended this to using an LLM to make the preference picks rather than humans2. Folks also started to use LLMs in an LLM-as-Judge pattern for evaluation after training: give the model a list of criteria, and ask it to rate how well the responses meet them. 

    The next notable step was RLVR (RL with Verifiable Rewards), which uses ground-truth data to provide rewards scores instead of a model. For example, a math problem might have a defined numeric answer, or a generated proof could be verified by a dedicated theorem prover program. This turned out to work very well for code and math and lead to the O-series of OpenAI models3 and many open reasoners, particularly Deepseek R1. 

    It’s a pretty natural idea to take a verifiable reward pipeline plug in AI scoring directly: rather than a model generate preference pairs and train a separate reward model, give the model criteria and ask it how well the response satisfies them. This means instead of letting a model work out what “good code” looks like from pairs of different (but similar!) solutions to a problem, you have a model working through a checklist, asking things like “Does it have types? Does it have comments? Would your coworkers hate you if you landed this?”

    These checklists are referred to as rubrics and Snorkel have started an interesting looking blog series introducing rubrics, which offers a definition: 

    A rubric is a structured guide that spells out what “good” looks like for each response from an AI system. 

    A rubric consists of:

    • A list of criteria: Does the code compile? Does it have comments?
    • How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
    • Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.

    In Nathan Lambert’s recent interview with Ross Taylor, Taylor calls rubrics out as an underappreciated research opportunity, particularly for agentic training:

    Rubrics are underhyped on social media – they were driving force behind projects like DeepResearch – and GenRMs are interesting but perhaps slightly overhyped.

    This caught my eye, as Moonshot leveraged rubric based rewards heavily in Kimi K2, notably using the model they were training as the judge of itself: 

    The framework operates using a Self-Critique Rubric Reward mechanism, where the model evaluates its own outputs to generate preference signals. To bootstrap K2 as a competent judge, we curated a mixture of open-source and in-house preference datasets and initialize its critic capability in the SFT stage.

    One of the core values of rubrics is that they work for both LLMs and humans. You can iterate on rubrics with people, scale them with LLMs, and spot-check LLM results with human raters to ensure reliability. 

    The paper [2507.17746] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains formalizes them as a full peer to Verifiable Rewards. The paper sets up rubrics so each criteria is a simple pass/fail and each has a predefined importance weight. They normalize everything so the system can’t get gamed by just adding more criteria4, and then plug in the resulting score in to the RL loop5.

    Of course, you actually have to write the rubrics, which leads to a specificity versus generality tradeoff: take more time to write more rubrics or rely on fewer, more general ones. The RaR paper makes it clear that more is better:

    predefined generic rubrics substantially underperform compared to prompt-specific ones, underscoring the importance of contextualization. Rubrics that include a broader range of criteria—both positive and negative—consistently outperform those limited to essential checks, suggesting that richer evaluation signals lead to better learning.

    As you might have guessed, the solution was more LLM: use a model to generate prompt-specific rubrics:  

    For each domain, the prompt (included in Appendix H) instructs the LLM to generate 7–20 rubric items based on the complexity of the input question. Each item is assigned a categorical weight (e.g., Essential Criteria, Important Criteria) to determine its importance to a correct answer. The rubrics are designed to be fully self-contained which means that non-expert readers should be able to evaluate response quality using only the rubric. 

    This particularly benefited from having a reference answer attached to the prompt. The models do a much better job of coming up with a good rubric if provided with a (human generated) “good” answer to judge against rather than just the question/prompt. This really opens the door to 1:1 rubrics: given questions and reference answers, you can generate a scoring checklist for each one and mix it with verifiable rewards during post-training. 

    The field continues to be turtles all the way down: using LLMs to write rubrics to have LLM judges evaluate LLM training outputs. At some point, someone’s going to suggest we use rubrics to evaluate how good our rubrics are, and honestly, I’m surprised that paper doesn’t already exist6.

    1. Correct in predicting the next token, and helpful, honest and harmless, specifically. ↩︎
    2. With humans still looped in to validate that the ratings were reasonable. The human graders went from generating ratings to rating the raters. ↩︎
    3. This is the part where everyone pretends they know exactly how O1 works, but actually we’re all just pattern-matching from breadcrumbs ↩︎
    4. Else we’d risk giving more focus to problems with more rubrics, and end up with something unthinkable like a coding model that liberally sprinkles emojis everywhere ↩︎
    5. In practice, they also tried a single LLM judge that took in all criteria and weights and generated a scalar reward, which seemed to work fine. ↩︎
    6. It probably does, I’m just scared to look ↩︎
  • Overthinking Everything

    Yann was taking laps on Threads a few weeks back over a recent paper, which was one of several recently that have explored aspects of how autoregressive models do as the amount of information they are dealing with gets longer. His general complaint is that each token they generate can either push them towards the right answer or further away from it, and that the models are inherently bad at recovering if they end up too far outside the correct trajectory.

    This “more might be worse” idea shows up anywhere folks are leveraging large context windows, and one of those1 is in agentic tasks. This post summarizes some research trying to measure the fall-off in chances of succeeding as task length2 increases.

    Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord

    It provides indirect evidence that what really is going on under the hood is that tasks are made up of many sequential subtasks and the chance of succeeding at the whole requires succeeding at every individual component. Moreover, this suggests that the current AI agents are not very good at recovering from earlier mistakes.

    The framing they use is a constant hazard rate: each subtask is another roll of the dice, and if you roll a failure you don’t have much chance of recovering. So more (or longer) is pretty much always worse.

    One interesting aspect is that they also investigate the human failure rate, which increases over time, but much more slowly:

    This could indicate a different scaling behaviour of success rate with time horizon for humans compared to AI agents, which would be well worth investigating and may suggest important underlying mechanisms (e.g. that the humans were better at correcting earlier failed subtasks). If human performance scales differently with task length than AI agent performance, that would be an important result, suggesting that there is a notable inefficiency in the current AI paradigm.

    They’re testing with multiple runs, so these aren’t just models hitting problems they can’t do: its models hitting problems they can’t do given the specific tokens they have generated tried before.

    Agentic use cases aren’t the only situation where a model is generating responses that add to its context window. There were a lot of early observations after the release of O1 last year that thinking for longer on easy problems does not add value. This recent paper proposes not only that but suggests that there is an inverse scaling law: more time thinking makes the model worse.

    [2507.14417] Inverse Scaling in Test-Time Compute

    More specifically, they devised some stress tests: things like counting problems in the presence of distracting information, performing a regression where there is easy-to-understand but spurious data, and so on. The performance drops as the trace length increases. Different models are more susceptible to some failure modes than other, but performance consistently drops:

    Our experiments reveal distinct failure modes across model families—Claude models are particularly vulnerable to distraction from irrelevant information, while OpenAI o-series models show greater resistance but overfit to familiar problem framings. Extended reasoning amplifies different weaknesses: models overthink simple problems, shift attention to spurious correlations, and lose focus during Deduction tasks with constraint tracking.

    In contrast, Chroma’s recent Technical Report investigates how models do on single prompts, but of increasingly long contexts.

    Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research

    Unlike in the agentic case, here all of the context is passed in at once, so the model isn’t poisoning its own context window through bad choices. It is still dealing with a large amount of content where it needs to choose which parts to attend to. Traditionally the test of long context has been needle-in-a-haystack evaluations: a relevant fact is hidden at different points in a long prompt and the test evaluates whether the model can effectively pull it out.

    The Chroma folks make the test a lot more nuanced — adding distractors3 and irrelevant content in both the broader context and the question. They find that performance consistently degrades as context increases.

    More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance

    All of these papers rhyme with LeCun’s gripe about autoregressive transformers, which is (roughly!) that they (also) have a constant hazard rate on generating the “right” token.

    This is a very active area of research though. Process-based rewards in RL training make updates on each step vs only at the end. Multi-token prediction reduces the effective generation length or number of chances of misprediction. Summarizing context effectively compresses existing tokens, also reducing error rate.

    Similarly, if you have good verifiers4 you can use beam or tree searches to explore multiple reasoning paths during generation , which can reduce the error rate, at the cost of more compute.

    The closest (LLMish) techniques to LeCun’s vision are things like the recent Hierarchical Reasoning Model that has a layer of persisting hidden state, but it’s still pretty experimental!

    As agentic and reasoning traces get longer, I’m sure we’ll see more entries documenting failure modes, and proposals for techniques to scale around them.

    1. And the one being referenced in the post! ↩︎
    2. In time — they characterize tasks based on how long it takes humans to do them, which is a good control factor ↩︎
    3. As in additional content related to the question, but that doesn’t give the answer. ↩︎
    4. Similar to process-based rewards this is somewhat pushing the problem to the ability to judge how well you are doing during the generation ↩︎
  • The Tools Are Made Up

    The Tools Are Made Up

    It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder1.

    Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

    Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

    The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards2.

    The Kimi-K2 technical report goes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent3 agentic models have been built on a reasoning foundation.

    What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

    Their approach to tool training takes this kind of idea ever further:

    We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

    They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

    Once they have this tool zoo, the actual training loop involves:

    1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
    2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
    3. Filter trajectories using another LLM as judge to keep only successful ones for SFT

    They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code4: thousands of sandbox environments.

    For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software
    development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

    The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.

    1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! ↩︎
    2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. ↩︎
    3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! ↩︎
    4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing. ↩︎

  • Post-Training & Elicitation

    Nathan Lambert of the Allen Institute writes about their (very strong) Olmo 2 32B release, and the just released Gemma 3 model from Google. One of the many interesting points:

    Comparing Gemma 3 27B to OLMo 32B, the pretraining evaluations for both are super similar, but Gemma 3 scores are way better after post-training. The ceiling on post-training expectations has been shifting extremely fast among open models.

    Given that Google have about the best crawling infrastructure in the world, and that Al2 have published the complete pretraining dataset used for Olmo, I think this is slightly surprising. You can see the benchmarks in the blog and technical report: for example, Gemma 3 27B gets 78.8 on winogrande from pretraining (a little below Gemma2 as it happens) while Olmo2 32B get 78.7.

    The vibes have definitely shifted to post-training for where model differentiation is coming from, opening the question of what exactly is happening there. Nathan also posted about that recently, linking to this post by Mohit Ragavendra of Scale and Georgia tech:

    The post looks at The Superficial Alignment Hypothesis, which is (largely) that post-training is just about preference tuning for behaviors the base model can already do

    […]

    It initially seems like “Less Is More” in the sense that the LIMA model response was highly preferred by the GPT-4 evaluator for Math prompts (in-line with the work’s original claim). However, these model responses were also largely incorrect – the accuracy of models fine-tuned specifically for Math was substantially better, with the same data budget. If we went by subjective win-rate comparisons, we would have picked a model that was significantly worse.

    In the post (and the two linked papers) Mohit breaks down how post-training actually helps. Starting with SFT, the work shows that mimicking style happens quickly, with relatively few samples.

    with just a hundred finetuning examples, the model’s formatting mistakes were virtually solved – the model was perfect at mimicking the expected style. However, the model took a lot more supervised finetuning data to get better at reasoning – the substance of the task.

    They find though that, largely, more-is-more when it comes to SFT, but that there is a power-law style scaling curve: big gains initially followed by slower, marginal gains. Adding in RL doesn’t change the fundamental curve, but it does shift it, leading more efficiently to the model gaining the reasoning capabilities they were training towards:

    Preference data offers a weaker signal compared to supervised finetuning data. So, running DPO directly on the base model on reasoning tasks, is asking the model to learn a completely different response style from its reference model, with a weaker signal, while penalizing for being different from the reference model. Small amount of SFT on the base model teaches it the reasoning style and PFT can use the reward signal to focus on reasoning within the required response space.

    I did wonder when reading this whether the results would look different with an online process (like PPO), rather than an offline. Luckily, Mohit links to another recent paper on this topic:

    We prove that under idealized assumptions, online
    and offline PFT techniques should return policies of
    equivalent quality

    but also

    we observe that despite the lack of information-theoretic separation, online PFT out-performs offline PFT across different sampling distributions, labelers, and model sizes. Furthermore, it appears to be “easier” to learn a global RM than it is to learn a local RM, leading to higher validation likelihood.

    The result here seems to be that the reward model is simply easier to model, and it helps “translate” the problem of the distribution.

    This all feels like a continuum: at some level the superficial alignment hypothesis is directionally correct but its not that “superficial”: the base models have a lot of capabilities that are hard to elicit, and fine tuning/post training can juice them effectively, while adding some learning of its own (as more data is better!)

    The best way of performing that elicitation turns out to be solving different problems at different levels: SFT for format, then RL for the deeper capability, and having a reward model effectively simplifies the learning process again.

  • Byte-Latent Transformers

    Who needs a tokenizer anyway!

    [2412.09871] Byte Latent Transformer: Patches Scale Better Than Tokens

    This paper, from back in December last year, presents an interesting approach to handling raw byte sequences in LLMs without relying on tokenization.

    Vocab sizes for tokenizers have gone up over the last couple of years with attendant gains in usefulness, but this remains a particularly hand-tuned number in the training process. BLT proposes a method that processes raw UTF-8 byte sequences directly, leveraging a dynamic patching mechanism to group bytes into variable-length patches based on entropy.

    Higher-entropy regions receive more attention and shorter patches, while lower-entropy regions can be processed more efficiently.

    There are conceptually three levels of processing:

    • Local Encoder: A small transformer stack encodes raw byte sequences into higher level representations, which are then structured into patches.
    • Latent Global Transformer: A standard large transformer model operating on patch-level representations
    • Local Decoder: The encoded patches are decoded back into byte sequences, using a cross-attention mechanism to reconstruct text.

    In the paper they show they can achieve parity in pretraining with a traditional tokenized approach in llama for similar parameter count, while being more robust and offering some inference time performance gains. The patching approach allows for allocating compute where needed most.

    Retrofitting existing models

    One of the ideas I found most interesting is starting with a traditionally pretrained model. The paper discusses using the main transformer layers from Llama and training the byte latent approach successfully.

    I gave the approach a go with a simplified local encoder, entropy and patching approach, and took the transformer layers from Qwen 2.5 3B, a strong model that could still be trained locally (no corporate resources were harmed, etc).

    The basic approach was replacing the tokenizer, adding a small transformer and patch pooling based on a local entropy measure to generate patches, then cross-attending in some of the Qwen layers. Its training a new encoder while leveraging Qwen for the backbone of the global transformer and adding new cross-attention params to make it also the decoder, with the embedding layers at each end chopped off – so a significant domain shift. For inference I leverage the same patch generation process to try and generate effective tokens.

    You can find my Torchtune recipe on GitHub, running through the Alpaca dataset. Thus far I’m still training so while loss is improving, I have no idea whether it will turn into something useful. The fact that there is something trainable is fun though, and I have hopes that this kind of technique will lead to some breakthroughs in tokenizer-free models in the future!

  • GRPO & Verifiable Rewards

    GRPO (Group Relative Policy Optimization) is an RL technique originally proposed in the DeepSeekMath paper. Instead of using a full-blown value network like PPO does, GRPO samples a group of completions for a given prompt and then computes a relative (normalized) reward for each output. The rewards are “verifiable” because they come from checking the final answer against ground truth and confirming. E.g. does the response follow the expected format (i.e. a <think>…</think> block for reasoning and an <answer>…</answer> block for the solution) and is the answer accurate against a predetermined fact. Not every problem fits this model, but there are a bunch that do, including math reasoning with the GSM8K dataset of grade-school math word problems. These look like this:

    “Julie is reading a 120-page book. Yesterday, she was able to read 12 pages and today, she read twice as many pages as yesterday. If she wants to read half of the remaining pages tomorrow, how many pages should she read?”

    How Does the Training Work?

    1. Sampling Completions: For each prompt, the model generates a group of candidate completions. These are produced in inference mode (gradients aren’t collected) using a KV cache for speed (or a dedicated inference engine like VLLM)
    2. Verifiable Reward Calculation: Each completion is scored between 0 and 1—rewarding outputs that follow the prescribed format and yield the correct answer.
    3. Forward Pass for Gradients: Both the “policy” (the model being tuned) and a reference (typically the base, unmodified model) are used for a forward pass with the prompt and completions to compute per-token logits and log-probabilities.
    4. Loss and Backwards: The loss is then calculated as a combination of the (group-averaged) reward and a KL divergence term between the tuned model and the baseline, to constrain learning to similar responses. This loss is backpropagated through the policy model based on the earlier forward pass.

    Getting it going in TorchTune

    Over last weekend I hacked up a quick and dirty version of the training loop in the TorchTune, and over a couple of bus rides to Menlo Park cleaned it up into something that could work as a more general recipe(PR). Most of the work goes into the recipe and getting the dataset shaped properly to generate completions. This version—tested on a smaller model (the 1B Llama 3.2 variant, with LoRA)—showed some promising improvements in approach but I didn’t get to the point of having something converge enough to be confident in the overall recipe. In the DeepSeek R1 paper they had discussed trying a smaller model, but found 3B was the lowest they were able to get results on with some of their fine-tuning approaches.

    Luckily for everyone, at around the same time Ariel Kwiatkowski also put together a version that included distributed device support, making it easier to experiment on bigger models. This PR is more modular, and I’m excited to see it refined and landed so the recipe is widely available!

    There’s a growing energy around tools like torchtune, and it’s exciting to see how easy it is to “hack on” these ideas. It’s also great to see the techniques show up in other libraries, like HuggingFace’s TRL, which is being used as part of the OpenR1 replication effort!