Blog

  • We can distill it for you wholesale

    There has been a lot of drama1 about distillation: how (closed) frontier models are being used by other labs to boost their own performance on particularly hard tasks.

    The drama is not fake, exactly. Anthropic, and recently OpenAI, have a notable lead in the agentic-coding domain, and some of that is from having data that other people don’t. Getting it is… not cheap:

    This is why there are huge efforts going on at certain companies2 to develop long form agentic trajectories. But! Not everyone has the money, or the engineers, to do that.

    So, there is an incentive to maybe, allegedly, copy some homework. It’s not clear though how exactly to do that: the frontier labs generally don’t share the chain-of-thought that their models are using while they reason, which means you only have a sparse signal to train your model on.

    One piece of the puzzle is in a paper from February this year, “Privileged Information Distillation for Language Models” by Emiliano Penaloza et al. at ServiceNow, which is probably not where most people are expecting the hot post-training discourse to come from. On-Policy Self-Distillation is spicy right now in post-training circles, and this is one of the earlier papers in the current zeitgeist3.

    The paper’s primary contribution is π-Distill: how do you do distillation when you have Privileged Information?

    “We ground our work in the task of distilling frontier models for complex multi-turn agentic settings. Typically, the industry standard for these tasks involves Supervised Fine-Tuning (SFT) on frontier model outputs followed by Reinforcement Learning (RL). Unfortunately, some model providers restrict important information, most notably the model’s full Chain-of-Thought (CoT) reasoning traces (OpenAI et al., 2024), providing only a summary alongside the action they intend to take. This opacity undermines standard distillation methods, as we can observe what successful agents do but not how they reason about it.”

    The rough idea is to not use the frontier model as a teacher, but to use it as a source of that privileged information:

    • You have one set model weights, run in two modes: a privileged teacher, and an unprivileged student.
    • A frontier model solves a task in its tool-use harness. You may not see its chain-of-thought, but you can observe what it actually does: its action trajectory.
    • That action trajectory is converted into the privileged information: tool names, tool calls with arguments, or a compact hint.
    • The teacher-mode model sees the task/history plus this privileged trace in the prompt. The student-mode model only sees the task/history in its prompt.
    • The teacher rolls out a trajectory and gets an RL reward4.
    • The student is then trained with teacher forcing: calculating loss based on how likely it would be to predict the actual next token the teacher generated.
    • The teacher and student losses are combined and applied to the single shared set of weights.

    As the authors continue, it doesn’t even require a closed model to distill from. Other kinds of privileged information can help you do the same trick, which is the second variant of their recipe. If you don’t have an outside source but you do know some bonus details (e.g. hints on how to solve it, or critiques on prior attempts) you can pass them into the teacher:

    • Let the student roll out, without the privileged information.
    • Then ask the informed teacher how compatible the student’s tokens were with what the teacher would have done.

    The discussion about distillation has focused on the idea of stealing some kind of secret knowledge. What this method really shows though is that distillation is about turning information that the model will not have at test time into behaviors it will have.

    Like any good teacher, having a sense of how to get to the answer is going to make it easier to help your student. The “on-policy” part here is that the student and teacher are the same, the difference is the teacher is reading ahead in the study guide.

    As tasks get longer, tool use gets richer, and agent traces get more valuable. The question is probably less “can labs hide the model’s reasoning?” and more “what clues can you train on?”

    1. And/or marketing. ↩︎
    2. Notably including the one I work at! ↩︎
    3. Another good read is “Self-Distilled Reasoner”, which was released few days before this, and is where the name comes from! ↩︎
    4. With a KL penalty that keeps it from drifting too far from the student. ↩︎
  • Maybe the agents shouldn’t write the kernels

    A thing you can do is take the most performance and correctness sensitive part of your stack and just ask a chatbot to write it for you. They will sometimes get it right!

    Back towards the end of 2024 Ouyang et al at Stanford attempted to benchmark how often that happened with KernelBench. DeepSeek R1 could one-shot 12% of simple ops, 36% of fused operators, and 2% of whole architectures. Still, things have moved on a bit in the past 18 months, and Han, Zhang et al.1 extended the idea in KernelBench-X. They found:

    1. Writing correct kernels and performant kernels is somewhat decoupled. You can refine kernels and that mostly helps with correctness: the models got more of the tasks compiling, but drops the average speedup on the way.
    2. What you ask for matters more than how you ask. The category of the task explained 3x more of the variance than switching between different agents or other method varieties.

    “Together, these results indicate that the capability boundary of current LLM-based kernel generation is not a single wall but a sequence of distinct barriers – compilability, semantic correctness, hardware efficiency and performance portability – each requiring different mechanisms to clear”

    In one particular area they tried getting the models to write quantization kernels, an area known for needing numerical precision. They got 0 out of 30. The models produced running kernels, just not kernels that were, you know, right.

    One thing that did stand out to me was that a lot of the baselines were eager PyTorch, so I decided to run an experiment myself. How do these models do against a compiler, not just eager?

    I took a popular model (Qwen, naturally), ran it through torch.compile with minimal settings on my DGX Spark and identified three kernels that were eating big chunks of the wall time: SwiGLU, residual+RMSNorm and the SDPA prelude. I then had ChatGPT, Claude and Kimi2 take a run at writing those kernels for the hardware.

    The results were an absolute blowout: SwiGLU 1.06x, RMSNorm 1.21x, SDPA prelude 3.91x. For the latter, Kimi stacked up three different weight matrices into one and fused multiple matmuls together. It was very impressive stuff.

    Then, another suspiciously well-timed paper arrived, FASTKERNELS from Snowflake. Rather than benchmarking against PT Eager or against single-operator references, they wanted to test how the models did on real model-serving problems, with a focus on the end-to-end speedups. Their takeaway:

    ”agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems.”

    All of those issues hurt end-to-end performance when you put them in a real model-serving context. Of the three strong kernel-generation agents3 they tried, none beat the production baselines

    “in contrast to the supra-unity numbers these agents have reported on operator-level benchmarks whose reference is PyTorch eager.”

    Taking another look at my vibed-up experiment results, as the FASTKERNELS folks may have suggested, there was a catch. Several catches.

    The baseline, it turned out, spent an awful lot of time doing… kernel dispatch. Even getting 3.91x speedup on SDPA prelude led to an end-to-end model speedup of… 1.007x. Not quite as exciting.

    You also had to be very, very careful about how the agents were getting speedups.

    For example, the initial correctness check accepted anything within cos_sim >= 0.95 of the reference kernel. Codex “won” the SwiGLU round by replacing sigmoid(x) with clamp(0.21*x + 0.5, 0, 1), a straight line which diverges from sigmoid everywhere except a narrow band near zero.

    It turns out this kind of thing is pretty common. The FASTKERNELS folks found a case where an agent needed to write an all-reduce kernel for cross-GPU synchronization. The test harness they were using was single GPU so the agent just no-op’d it, replacing the all-reduce with a straight tensor copy. This “passed its checker but produces the wrong sum on every scenario of our 4-rank NCCL+Gloo harness.”

    Even when the kernels are right, and fast, it doesn’t mean they are… good? Several of the generated kernels in my experiment were somewhat unshippable due to hardcoded shapes or silent global mutations. FASTKERNELS found similar things:

    “L2 failures are dominated by syntactically valid kernels that respect the per-tensor signature but violate the surrounding production contract.”

    Which I think is the academic way of saying they wouldn’t ship those either.

    If you get your verification of the problem wrong in the harness, you will get a kernel optimized for the harness. Use the wrong contract and your kernel will be wrong in exactly the shape of your wrongness.

    Still, a small win is a win, right! My original run had agents outperforming torch.compile by about 2.6%. At that point I had a friend take a look who immediately pointed out that I had hampered the compiler unrealistically, and suggested running on max-autotune. This was especially unfair since the agents each got several cracks at the problem. Turns out, with that baseline the agents lost by 4.6%.

    And, that’s pretty similar to what FASTKERNELS found. Across 88 tasks and three agents the best of theirs landed at about 0.94x4 the performance of the production stack.

    Fairly late in the day I decided to replicate the experiments I had run on the Spark on a 3090. That’s Ampere, sm_86, an elder statesman of consumer GPUs at this point. It turns out that once again, some of the wins were just worse baselines. For example, Kimi tried the same SDPA-prelude matrix stacking as on the GB10, but on Ampere the 3.91x speedup turned into a 0.74x loss. The difference was cuBLAS: it was simply better tuned for the 3090 than the GB10, and did a much better job of utilizing all 82 SMs. The baseline Kimi had to beat was (relatively) higher.

    The question of “do agents beat compilers” is hard to answer because what we are (roughly!) measuring is compiler maturity. Agents are most useful in exactly the window where a compiler is weakest: new silicon, untuned heuristics, and libraries that are still evolving5.

    As libraries improve, hardware is better understood, and compilers mature, the value of exploratory search diminishes: there are “right ways” and it’s better to just use them than create custom solutions. If an agent is identifying patterns reliably and repeatably, it may as well author a compiler pass and spend more tokens on the areas that can’t be as cleanly captured.

    1. I think these folks are associated with Tsinghua, but to be honest I am not entirely sure! ↩︎
    2. Each model ran in their respective coding harnesses. One fun takeaway was the wall-time for generating the kernels was a factor too. Kimi took at least 3x longer than the other agents, spending a lot of tokens on the way, but also generated the most performant kernels of the three on every task on Blackwell, which was not what I was expecting. ↩︎
    3. Codex, KernelAgent and Dr. Kernel, the latter of which I hadn’t heard of but has by far the best name. ↩︎
    4. Codex landed at 0.94x. KernelAgent at 0.78x. Dr. Kernel got 0.53x, but still billed my insurance. ↩︎
    5. I suspect this is particularly pointed for the GB10, which is an unusual piece of hardware, and in particular has a lowish number of SMs. ↩︎
  • The elusive order of things

    SIMT offered a fantastic bargain. You write a straight-line program, the machine runs a lot of copies of it, and when one waits for memory the hardware swaps in others. You look with disdain on the less enlightened thread programmers dealing with deadlocks and concurrency etc. etc.

    Choosing what to run where and when is a scheduling problem, and there have been three effective approaches to that so far.

    You can schedule statically: decide ahead of time what all the units should do each tick. You can schedule temporally: swapping in different phases of workers via a pipeline. Or you can schedule spatially: divide the resources of the machine into different roles.

    The underlying mechanics of which one you pick tends to be determined by the hardware. A chip like a TPU spends most of its silicon on math, and fairly little on orchestrating work. That means static scheduling, and a compiler that can build you that schedule.

    Ampere and before1, and all the modern AMD chips, encourage temporal pipelining. The hardware will swap in warps (or waves) when one stalls ,and by structuring your kernels into phases you can hide memory latency and keep the chips busy.

    Hopper and beyond are where spatial scheduling started mattering, in the form of warp specialization. Nvidia GPUs let you assign different register footprints to different warp groups. When you introduce warp-group scoped MMA for compute and TMA for executing data moves from a single thread you have the ingredients to divide the pipeline between groups. Instead of the same worker doing load -> compute -> store you have different workers exclusively working on different parts of the pipeline. Blackwell made this… much harder. TMEM and UMMA added new operator and memory types, so you now need to schedule movement between shared memory, tensor memory, registers, global memory, and a variety of compute units.

    The problem is: how do you do that?

    To stick with Nvidia for a moment, at the bottom of the stack are barriers. An mbarrier is a phase switch for a specific number of arrivals: one side waits, the other increases the arrival count. When the counter matches the expected number, the phase flips. It’s elegant and straightforward, and easy to get wrong. A classic example is the phase parity bug: if you screw up the wraparound the kernel can work perfectly at first, but then deadlock waiting on the wrong phase.

    Next up, libraries like CUTLASS, and newer ones like ThunderKittens, package the patterns you tend to write. The CUTLASS Pipeline combines buffers and synchronization into a unit and makes it easy to compose common structures. This is where much of the expert-kernel-writing time goes, but that time encodes a lot of hardware-specific behavior. Hopper wants one set of patterns, Blackwell another, and even within a generation there can be differences between variants of the hardware. The more explicit the schedule is for the developer, the more they own the portability problem.

    The subsequent step is to make the schedule less explicit, while still keeping the roles visible. AsyncGraphene’s ARef is a good example of this. An ARef is a reference to asynchronously produced data. Basically, a channel, with synchronization attached. A producer writes, a consumer reads, and both sides can know when the other is done. A compiler can then plan a schedule. Nvidia’s TAWA work does this explicitly for Triton, tagging producers and consumers and lowering to ARefs. TLX on the other hand, as well as systems like PipeThreader, allow defining subtasks in a kernel that a compiler can schedule.

    TileIR and CuTile also enable building an explicit graph, but through focusing on the data itself. Attaching usage information on how data is read or written gives the compiler room to bundle work into tasks and reschedule.

    Getting the graph is the starting point, but then you need to identify what the right schedule actually is. In practice this involves exploring different shapes and combinations to work out which is best. You can either do that explicitly through heuristics and cost models of the hardware, or do it via searching across many different possible schedules to find the ones that work best. Most systems do both.

    But what do we need in a kernel DSL?

    If you are building a DSL for writing kernels, the starting point is to reflect whatever the hardware does. This is not only direct, but also a necessary option because there are always smart people operating at the frontier who have a strong intuition around how to drive the most performance. They’re often targeting very new hardware which is not yet well understood (sometimes, even by the people that made it).

    Beyond that, deciding what else should be on offer means answering three questions:

    1. How do you think about portability?

    Portability doesn’t mean “write one generic kernel and get peak performance everywhere”. But it can mean: what’s the minimal amount I can express to get correctness and a particular level of performance across hardware. Projects like Helion are explicitly operating at a high level to enable rapid research iteration. Regardless of your view on where the line for “high performance” is, you need something to define what “correct” means.

    Having a good concept of a “task” seems to offer the flexibility to schedule statically, temporally or spatially, but there are a lot of edge-cases to consider.

    2. What do agents change?

    Humans are not going to be writing every, or even most, kernels. We have to figure out how much of that portability or performance is a deterministic search, versus how much is agentic loops exploring the space somewhat probabilistically. Agents make generating code massively cheaper. They can create candidates, run profiling on real hardware, test hypotheses and explore options.

    But we also need a sense of where and how the agents fail, particularly when it differs from the patterns of humans. That includes things like verbosity: more lines (generally!) means more bugs. Performance can be both spiky and somewhat subjective; sometimes small changes can reshape the kernel’s performance, and a faster kernel might only be “correct” within specific numerical accuracy bounds.

    3. How do you think about kernel boundaries?

    A lot of discussion focused on GEMMs, which is understandable. But almost all real-world kernel work is across operator boundaries. FlashAttention wasn’t making the matmuls in attention fast, it was fusing them despite a reduction in the middle.

    When we are writing programs we are expressing intent and providing direction. We mix that “what” and the “how”. This reflects a search vs expressiveness divide; search-oriented approaches want you to focus more on the what, expressiveness leans more into the how. The more the units inside kernels can compose across kernel boundaries, the more we can optimize across models and discover patterns automatically2,

    The way I think about compilers is that they encode knowledge (in the form of rules and heuristics) about hardware. The more we can move that out of our heads, or the model’s parametric knowledge, the more we can focus our time or tokens on the parts we don’t yet understand.

    1. Mostly! cp.async was introduced in Ampere and it was very impactful in making temporal pipelining work, as it let the mechanism largely hide HBM latency ↩︎
    2. Whether via compiler, or agent! These problems tend to recurse, so you could have pretty much this whole discussion at the IR level too. ↩︎

    ,
  • Loss Exploded.

    If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically: 

    • Loss exploded. 
    • Doesn’t learn.
    • Loss exploded.
    • Etc.

    At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture (GELU to RELU for example), dealt with hardware failures, avoided bad data, and tried to debug what was going on. 

    That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body of knowledge on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers.  

    But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4, is pretty great! 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth:

    “Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.”

    One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always. It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts.

    The techniques DeepSeek used (expert routing based on stale params, and clipping) did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out:

    “Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community”

    Which has echoes of Noam Shazeer’s similar observation for SwiGLU: 

    “We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”

  • Unbundling Work

    Recently I had a conversation with an infrastructure team supporting an ML modeling group. The two orgs used to collaborate to ship experiments: the modeling team would come up with ideas, the infra team would augment their frameworks and build out tooling to make those ideas scalable. Together, they would ship an experiment every couple of weeks. Now the modeling team is largely making the framework changes and performance improvements themselves, thanks to coding agents, and are shipping a few experiments every single week. The infra team are still busy, but they are firefighting and debugging when the agents get stuck. The modeling team are much more productive, undeniably, and all the humans are busy, but the work for the infra team has ended up somewhat worse.

    If you are a tech CEO who has recently returned to coding, you could look at the team doing the lower-scale firefighting and think “do I need these people?” If you keep taking that question to its conclusion you eventually ask… do I need anyone to do anything at all?

    This question, helpfully, predates the term AI 1: Back in the ’30s, Coase wrote his theory of the firm on why companies do some things in-house, and buy others from the market. For a brief period in the early 00s it looked like software jobs would go to the market, thanks to outsourcing. This largely didn’t happen, because, as Coase predicted, specifying a project is tough. Creating software is an iterative process; you don’t know exactly what you’re building until you start, so you need people with technical taste to be making decisions in a consistent way.

    There are a lot of Steve Jobs stories with this flavor. For one, Jobs wasn’t happy with the jiggling when holding down icons on the iPhone to remove them. The team built a UI with sliders so he could adjust the jiggle rate until satisfied. Once perfected, copying it was easy, but assembling a group that cares about those kinds of details is hard.

    One way to find those people is to train them. Gary Becker wrote about human capital back in the 60s, and in his framing some training imparts skills which are marketable; some which are firm-specific. Companies will pay for firm-specific training but are less keen on paying for marketable skills because rivals can free-ride on it by poaching employees once they are trained.

    From The AI Becker Problem:

    “If Company A invests time and money to turn a raw college graduate into an expert, Company B can hire that person after five years of experience for a higher salary, collecting the benefits of skills Company A paid to build. In the past, firms tolerated this risk because juniors were producing valuable work along the way. Without that value, the economic foundation of apprenticeship collapses entirely.”

    This shows up is in the L3-L5 progression in big tech companies. I’ve seen many hiring managers be hesitant to hire an “industry four” as they don’t yet have the rounded, marketable skills the manager wants. But within the companies they have (effectively) apprenticed at, L4s contribute a huge amount of value. Is AI blowing up that trade-off?

    The author of the AI Becker note, Luis Garicano, recently put out another paper on AI disruption asking when AI actually displaces jobs. In Garicano’s framing jobs are bundles of tasks and responsibilities; AI’s impact depends on how tightly these tasks are tied together.

    “In a strong bundle, breaking the job destroys enough value that the job survives as a whole: AI assists but the human still sells the full service and retains a large share of revenue. In a weak bundle, the cost of splitting is small: AI replaces some tasks, the human role narrows, and the labor share falls.”

    Software engineering involves writing code, operating services, decomposing problems, and aligning with others (both project-wise and culturally). Current AI coding agents attack part of this bundle, but humans comparatively excel at social dynamics and maintaining the larger world view necessary to know which problems to focus on.

    At the senior levels the ties seem strong: you can take the coding and task breakdown out of it, but that wasn’t the main thrust of your L7-9 engineers anyway. At less vaunted levels, companies will need many fewer software engineers to churn out code than they have doing it now. But as the ML infra example earlier showed, that doesn’t necessarily mean you don’t need some of the other things they can do.

    This opens a risk for the business: if you need senior folks but don’t have enough valuable work to justifying training them yourself, you are stuck paying market-rate for increasingly rare talent. Right now if you happen to have, say, scaled LLM post-training experience you can command a very significant salary. Or just start your own company.

    Hiring is hard even for deep-pocketed executives when key skills are firm-specific rather than marketable. Apple can’t go out and hire the kind of people with the taste it develops internally (generally). But how much are firms willing to roll the dice on developing the next Jeff Dean, and how much are they willing to risk someone else hiring them away?

    For a similar dynamic, look at investing. Over the past decades, much of the junior analyst work that undergirded investment firms has been replaced by automation. The structure that emerged was the pod shop, or more formally a multi-strategy hedge fund. They operate more like a platform that hosts “pods”, each led by a portfolio manager who is supported by analysts, data scientists and traders. Each pod has its own domain of speciality, and its own profit and loss. The firm centrally manages risk and allocates capital to pods. Successful portfolio managers earn a healthy percentage of the profits they generate, while unsuccessful pods are taken out behind the woodshed and shot. This both gives a talent development pipeline and a rigorous performance standard, albeit not a very collaborative one.

    This works, in part, because there is a very clear score card, measured in dollars. We might be able to copy the structure in engineering teams, but actually evaluating how well things are going is hard!

    Firms that have the highest dependence on people you can’t easily hire are exactly the ones who are at risk of struggling in this transition: they have the most need to grow their own people, and the least economic reason to do so. Apple can’t buy another Apple, and neither can anyone else.

    1. Back when people still used the term Cybernetics. AI researcher drama is literally as old as AI. ↩︎

  • Native DSLs Ops in PyTorch

    You may have noticed that FlashAttention 4 was supported in PyTorch really quickly. That required a bit of new infrastructure: torch.native by Simon Layton. Prior versions of FlashAttention were written in Cutlass/C++, but for FA4 the team implemented the kernel in CuteDSL.

    Edit: Simon kindly pointed the FA4 work integration work predated his formalization of this pattern and was the impetus for it: long-time-SDPA maintainer Driss landed the change. As always PyTorch takes a village and I am glad for everyone’s contribution!

    You wouldn’t think that using an embedded Python DSL in a Python based Ml framework would be a challenge, except that almost all of the stuff that does ML in PyTorch is in fact… not written in Python. Replacing a PyTorch operator meant shipping a new native kernel and dealing with the build and dispatch pipeline.

    Layton’s change opened the door to overriding default ops with ones authored in a embedded DSL, initially Triton or CuteDSL.

    To be clear, this is not a replacement for custom ops, which most of the time is the best way of adding a new operator. torch.library.triton_op already lets you register a customer Triton kernel, for example. But FA4 is the kind of situation where wended an an alternative: it’s the right path for newer GPUs, it’s written in CuteDSL, and the PyTorch team wanted it to be available quickly to all PyTorch users without modifying their models.

    To give an example, we can replace the built-in aten::_fused_rms_norm with a Triton version1:

    """
    Triton kernel for fused RMS normalization.
    RMSNorm(x) = x / sqrt(mean(x^2) + eps) * weight
    """
    import triton
    import triton.language as tl
    import torch
    @triton.jit
    def _rms_norm_fwd_kernel(
    X_ptr,
    W_ptr,
    Y_ptr,
    RRMS_ptr, # reciprocal RMS, saved for backward
    stride_x_row,
    N_COLS: tl.constexpr,
    eps: tl.constexpr,
    BLOCK_SIZE: tl.constexpr,
    HAS_WEIGHT: tl.constexpr,
    ):
    # [...]
    tl.store(RRMS_ptr + row_idx, rrms)
    def triton_rms_norm_forward(
    x: torch.Tensor,
    normalized_shape: list[int],
    weight: torch.Tensor | None,
    eps: float | None,
    ) -> tuple[torch.Tensor, torch.Tensor]:
    """Fused RMSNorm forward pass using Triton."""
    # [...]
    return y.reshape(orig_shape), rrms

    Actually hooking it up requires calling a DSL-specific op override function, in this case triton_utils.register_op_override. This goes directly into the dispatch architecture, which means it works with autograd, torch.compile and so on.2

    """
    Register a Triton-based RMSNorm as a native op override using torch._native.
    """
    from torch._native import triton_utils
    def _triton_fused_rms_norm(dispatch_keys, x, normalized_shape, weight, eps):
    """
    Wrapper that lazily imports the Triton kernel on first call.
    """
    from triton_kernels import triton_rms_norm_forward
    return triton_rms_norm_forward(x, normalized_shape, weight, eps)
    def register():
    """Register the Triton RMSNorm override."""
    triton_utils.register_op_override(
    "aten", # lib_symbol: override an aten op
    "_fused_rms_norm", # op_symbol: the specific op
    "CUDA", # dispatch_key: only on CUDA
    _triton_fused_rms_norm, # impl: our wrapper
    unconditional_override=False, # receives dispatch_keys as first arg
    )

    Now when we call torch.ops.aten._fused_rms_norm(x, shape, weight, eps) PyTorch will automatically use our Triton override!3

    The unconditional_override param in the registration call is a helpful one: if false the function receives torch.DispatchKeySet as its first argument. This allows overriding only in specific circumstances. For example, our Triton kernel is faster than the C++ one only for larger shapes, so we could gate the decision on that:

    def _smart_rms_norm(dispatch_keys, x, normalized_shape, weight, eps):
    n_rows = x.numel() // normalized_shape[-1]
    if n_rows < 4096:
    # Fall back to default C++ kernel for small shapes..
    return torch.ops.aten._fused_rms_norm.default(x, normalized_shape, weight, eps)
    from triton_kernels import triton_rms_norm_forward
    return triton_rms_norm_forward(x, normalized_shape, weight, eps)

    Going back to FlashAttention4, this overrides aten::_scaled_dot_product_flash_attention so that any code using torch.nn.functional.scaled_dot_product_attention will transparently get the FA4 implementation.

    torch._native fundamentally lowers the barriers to entry for bringing new kernel implementations into PyTorch. That’s good for mainline PyTorch, and it also allows ML infrastructure teams to ship optimized kernels for new hardware without waiting for the PyTorch release cycle.

    1. Trimmed for length, ask your favorite coding agent to write you an RMSnorm kernel, or look at the gist. ↩︎
    2. To avoid hammering the import latency, all DSL runtimes are lazily loaded when the kernel is first called ↩︎
    3. In this case we only override for CUDA tensors, so CPU ops will continue to be handled by the default implementation. ↩︎
  • Programming The Loop

    Programming The Loop

    When you have enough AI, what do programmers… do? When it was smart autocomplete (e.g. Copilot), that was pretty clear: everything! The AI handles some typing. When it was interactive IDEs (e.g. Cursor) it was still a lot: pair programming, designing, writing the hardest parts. Now it’s an independent agent (e.g. Claude Code) it’s guiding, reviewing code, setting guardrails.

    But, you know, we want to move faster than that! That means either we have the agent running in a loop without needing us, or we have lots of agents doing things at the same time1. Or both.

    Throwing agents at a problem doesn’t automatically solve it2. Which leads back to the question: “what do we do?” The answer seems to be not so much being in the loop but designing the loop itself.

    The most viral agent loop right now is Karpathy’s Autoresearch, which finds verifiable training improvements to his nanochat project. Running Autoresearch is straightforward: A human writes program.md with workflow guidelines, the agent runs in a loop trying ideas. Karpathy’s workflow allocates a fixed compute budget and constrains edits to a single training file to ensure the experiments are valid3. The agent generates ideas, verifies them, then refines: keeping the new baseline and discarding failed ideas.

    While Karpathy’s agent-in-a-loop is responsible for both generating and implementing ideas, PyTorch’s KernelAgent4 goes multi-agent, giving each specialized roles and toolsets for improving GPU kernel performance. A profiling worker identifies opportunities, an analyzer agent suggest potential fixes, and so on. The actual execution is best-of-N sampling as an agent loop: it spawns N workers, lets them race, then plans a strategy for the next round.

    “Optimization agents reflect on what succeeded and failed in each round, summarizing insights into a shared memory that guides subsequent iterations and prevents repeated dead ends.”

    KernelAgent – PyTorch Blog

    The pattern that seems to work is to set up agents in a generate-verify-refine loop, following a pre-defined work approach, with guardrails. If you need more parallelism, add multiple agents, but keep state central to avoid communication overhead.

    An example of the latter is OpenAI’s Symphony. This moves state into a task tracker then spawns5 individual codex agents with a fixed budget of iterations. Individual agents write back to the tracker to save state. This type of agent usage is also known as a “Ralph loop”: agents that start fresh for each iteration of the loop, with necessary context injected each time rather than accumulating organically in the context window.

    Much like with Karpathy’s program.md you “program” the WORKFLOW.md with how you want the loop to run, then it executes autonomously.

    Designing the workflow feels like a genuinely different skill. It’s not writing the code, the agent does that. It’s not specifying the solution either; in many cases the agent does that too! It’s about designing an approach: how can the agent make progress with each turn of the crank? How can the environment give clean validation signal to the agent about its approach? Not easy, and not quite what we used to do either.

    1. AKA agent teams, swarms, or whichever Mad Max movie Yegge is on today. Google’s new “Towards a Science of Scaling Agent Systems” paper is not keen on multi-agent systems though: “on tasks requiring strict sequential reasoning […] every multi-agent variant we tested degraded performance by 39-70%”. ↩︎
    2. Condolences if your executives are currently pushing that as company strategy A. ↩︎
    3. Mostly: it did engage in a bit of seed hacking, so has achieved postgrad status successfully. ↩︎
    4. Disclaimer: folks in my team worked on this! ↩︎
    5. It also uses Elixir for coordination in the sample code, which brings warmth to my heart. Hello Joe. ↩︎
  • Cutie Fly

    Cutie Fly

    The FlashAttention 4 paper is out and is fascinating, you should read it! One of the things that Tri called out on Twitter was that the experience of using a Python-based language (CuteDSL) significantly improved the dev loop, not just for him, but for Claude:

    CuTe’s layout algebra plus the quick iteration cycle of a Python DSL are a nice combination. Hence, it’s not too surprising that late last month,AMD dropped FlyDSL, which is, largely, CuteDSL for AMD. This is not a knock on FlyDSL! The project is very open about acknowledging CuTe and its provenance.

    To help navigate, here is a handy translation guide:

    • CuTeDSL: cute.make_layout.
      FlyDSL: flir.make_layout.
    • CuteDSL: cute.composition.
      FlyDSL: flir.composition.
    • CuteDSL: cute.zipped_divide.
      FlyDSL: flir.zipped_divide.
    • CuteDSL: cute.make_tiled_copy_tv.
      FlyDSL: flir.make_tiled_copy_tv.

    FlyDSL also calls out Colfax’s paper from earlier this year: Categorical Foundations for CuTe Layouts. This paper, along with the Integer Set Relations one from Nvidia last year, really started to establish a mathematical formalization of what had been going on in CuTe layouts. This kind of foundation enables verifying the approaches taken in fresh implementations, like FlyDSL’s.

    We can actually go see that, as the whole compiler is open source. This lets you compare the composition_impl in FlyDSL to the diagrammatic version in (section 4.1.3) in the Colfax paper to understand why it works!1

    Given the blistering pace of layout algebra, we shouldn’t be surprised that just a few days after, Cris Cecka of Nvidia dropped a beastly preprint: CuTe Layout Representation and Algebra:

    Colfax Research [19] analyzes CUTE layouts and some operations on them in the context of category theory. In this paper, we intend to provide a more definitive and formal treatment of CUTE concepts and their applications.

    Sometimes with this kind thing it doesn’t matter if you have the idea, it’s often specific implementations of that that end up defining the standard for it. I read this paper as Cecka planting his flag and saying “y’all, this is the real CuTe”. And he cuts no corners.2

    Cecka reframes layout algebra as a theory of loop transformations, showing that the objects you are transforming (Shapes Strides) and the things you are transforming them with (Shapes Strides) are the same.3

    One of the cleverest results of this is in Section 2.3.1. Cecka demonstrates that strides don’t have to be just… regular strides. If your stride is in fact a coordinate then each “step” in the stride moves in the coordinate dimension.

    This is, for example, what you need for TMA on Hopper or Blackwell: you tell it the logical position in the tensor and it figures out the physical address internally, handling tiling, swizzling and bank conflict avoidance in hardware. If you stride over coordinates, you can use exactly the same layout algebra as for your computations.

    Another example was that if a Stride is a bitmask you get something very like Triton’s LinearLayouts!3 That lets you compose layouts with swizzling, using the same composition operators again.

    The paper is full of these interesting, but also practical, results. Cecka gives guidance on optimizations like ‘avoid ranged slicing’; (e.g. a[2:4, 1:3]) as it mixes up tile size (an optimization knob) and thread ID (a runtime index)4, or use layouts to algebraically work out how to auto-vectorize loads and stores rather than hard coding5.

    There is something satisfying about paper on composition that itself pulls together ideas floating around CUTLASS internals, preprints, and alternative implementations, then shows they are all views of the same object. This will help projects like FlyDSL, Triton, and any number of other authoring libraries ground their management of one of the most painful aspects of kernel dev in a way that should make life easier, for everyone.

    1. I think! My understanding of category theory is similar to my understanding of Skibidi Toilet: I get the idea, but I have so many questions. ↩︎
    2. As an example, Cecka provides a wider generalization than the Colfax paper, demonstrating that CuTe layouts are not strictly closed under group composition: you can’t always compose layouts however you want. But! The failures correspond to real errors, which is the kind of restriction you actually want. ↩︎
    3. Actually, you do a bit better: being strictly in F₂ means Linear Layouts are limited to powers of 2, which it turns out is a bit limiting. ↩︎
    4. This makes it harder for compilers to separate static and dynamic elements. CuTe, and Fly, do this in two stages: zipped_divide to tile. then slice by a dynamic bid, allowing the compiler to optimize (e.g. constant-fold) the static tile parameter. ↩︎
    5. By composing the layout with the right-inverse of the other, apparently! Or calling max_common_vector(src_layout, dst_layout) ↩︎
  • Perplexed

    The normal loss when pre-training a language model is Cross-Entropy, which sounds more complicated than it is. As it generates a token, the model doesn’t just predict a token, it predicts a probability distribution across all possible tokens. Cross Entropy loss is -log(probability of the correct token) from that distribution.

    • If p(correct) = 0.99 → CE ≈ 0.01
    • If p(correct) = 0.5 (unsure between two tokens) → CE ≈ 0.693
    • If p(correct) = 1/100_000 (e.g. guessing uniformly) → CE ≈ 11.5

    If you average the CE over a whole bunch of tokens (say in your validation set) and take e^(ave CE), you get the perplexity, or PPL.

    The number gives you an idea of how many choices the model was considering. Perplexity of 1 means the model was always 100% sure and 100% right (a feat only Elon can achieve). PPL 2 means the model was flipping a coin between two tokens most of the time. PPL 50 means the model was uncertain between 50 plausible next tokens. Because you’re already calculating the loss, PPL is very cheap to compute, so it gets used a lot.

    Prior to pre-training you’ll typically run a sweep of experiments of different architecture tweaks, and see which lower perplexity. During pre-training you’ll want to check whether the model is successfully learning, whether you should nuke a run rather than continuing: improvements in perplexity are a good guide to that. You can also score perplexity on fresh data using a well-trained model: data with a surprisingly high perplexity might be garbage, or a counting subreddit.

    Still, you can have too much of a good thing. A new paper from Veličković et al “Perplexity cannot always tell right from wrong”, makes the argument that, much like with humans, its very easy to select for confidently wrong rather than uncertainly right.

    We prove that, for a wide class of decoder-only Transformer-based language models, should the model be highly confident and correct on a sufficiently long input sequence, this must imply existence of another input where the model’s prediction is wrong, yet the log-perplexity of that prediction approaches *zero*

    The basic idea is that when the model is confident, you can construct a different sequence that the model would be equally confident on but also… wrong.

    This particularly shows up when contexts get longer, because all tokens are not equal. To give a trivial example:

    In the word "strawberry," there are 8 Rs.

    This is correct for every single token, except ‘8’. A highly confident model may have a lower perplexity for that sequence, as a whole, than a more correct but less confident one.

  • The model can probably write the code

    The current vibes in software engineering are a mix of crushing despair at years of accumulated personal skills being displaced by the CEO prompting some stuff, and crushing despair at years of corporate investment in an existing codebase that isn’t vibe-y enough. People worry whether the models will be effective in their programming language of choice, not on some general benchmarks.

    One angle to approach that is to ask how well the language is covered by the distribution of the training data1. An interesting paper the other day gave a pretty clear idea of how to check: 1-shot some prompts against the base model and see if they ever get it right. Getting access to base models is not always possible, but you can certainly call the post-trained models with roughly the same idea: no tools, no iterations, just generate this program.

    To try this, I2 wrote up 20 project-euler like3 puzzles of varying difficulties and had a few different models YOLO solutions in several languages. These ranged from common ones like Python to fairly rare ones like Zig and Hack.

    After validating all the solutions, we can calculate some stats using pass@k: in k trials, how often did the model solve the problem. Here’s some stats for pass@1: what % of the time can you expect the model to one-shot the solution:

    LangGPT-4.1 MiniGemini 3 FlashOLMo 3.1Kimi K2.5GLM-5
    Python.93.99.72.97.98
    Type Script.941.00.43.95.95
    Go.95.91.46.86.86
    Rust.89.94.43.95.95
    Kotlin.90.99.29.91.93
    OCaml.76.86.08.94.90
    Zig.14.55.00.79.88
    Hack.43.76.05.47.68

    And here is the same thing for pass@128: what is the chance it is right at least once in 128 samples:

    LangGPT-4.1 MiniGemini 3 FlashOLMo 3.1Kimi K2.5GLM-5
    Python1.001.00.951.001.00
    Type Script1.001.00.901.001.00
    Go1.001.00.851.001.00
    Rust.951.00.881.001.00
    Kotlin1.001.00.591.001.00
    OCaml.981.00.381.001.00
    Zig.491.00.051.001.00
    Hack.991.00.461.001.00

    To make that a bit more visual, here is a per-language chart for GPT-4.1-mini:

    Line graph showing pass@k curves for various programming languages with k (number of attempts) on the x-axis and pass rate averaged across problems on the y-axis. Languages include Python, TypeScript, Go, Rust, Kotlin, OCaml, Zig, and Hack.

    Given enough chances GPT 4.1-mini solves all the problems, in almost all the languages. Of course, we don’t actually know what GPT 4 was trained on, but we do know what OlMo 3.1 was trained on, thanks to the wonderful folks at AI2. That means we can see how much code-specific data for each language there was4:

    LanguageCode Corpus (GB)Est. Tokens (B)Category
    Python60.4017.3High-resource
    TypeScript26.527.6High-resource
    Go23.786.8High-resource
    Rust9.112.6Medium-resource
    Kotlin5.681.6Medium-resource
    OCaml1.030.29Low-resource
    Zig0.180.05Low-resource
    Hack0.000.00Very-low-resource

    There is a pretty decent correlation between the presence of training data and the pass@k rates. But, importantly, its not 1: despite Hack having no StarCoder data and Zig negligible, the model clearly does know at least something about them. Given enough chances it has a decent chance at coming up with the correct answer for Hack, and a non-zero one for Zig:

    Line graph depicting the relationship between training data volume and average pass@k scores for various programming languages, including Python, Rust, Go, and Zig, with different markers representing pass@1, pass@10, and pass@128 metrics.

    We have seen for human language that models learn a language substrate, enabling them to perform strongly even on tasks they haven’t seen such as translating between unseen language pairs. I suspect something similar happens with code: despite the language differences there is a logical programming substrate, and the model doesn’t need much exposure to the language in order to generalize to it.

    Once you start giving the model multiple attempts, it gets into the right region quickly for the high-resource languages: with GPT-4.1 mini, Python, TypeScript, Go and Kotlin saturate at k=10. The less-common languages continue to rise: the model can write valid OCaml or Zig or Hack but need more attempts to stumble into the right region.

    Thinking models flatten the curve substantially. Kimi K2.5 and GLM 5 both use high effort by default5, and that appears to give them multiple bites at the apple from internally exploring and self-correcting. By k=10 the models saturate all problems on all languages, though at the cost of a remarkable number of tokens6!

    It’s also instructive to see the ways in the which the models get it wrong. There were four patterns that showed up:

    1. Ecosystem: One problem involved a sum of very large digits. GPT-4.1 Mini regularly used num::BigUint. This is a crate, not a standard language feature, and in an agentic loop would probably be a very valid choice but doesn’t strictly work. In contrast, GLM-5, a thinking model, implements digit-by-digit multiplication from scratch with Vec<u32>.
    2. API confusion: The model knows roughly what the code should look like, but chooses the wrong API. For example, OlMo generated while ... do ... in mixing OCaml’s while...do...done loop with Haskell’s do notation and OCaml’s let...in binding.
    3. Surface-form invention: The model has a sense of how things stylistically look in the language, but doesn’t know the real API. GLM occasionally writes Zig with invented functions: std.mem.Allocator.alloc(usize, limit) (Allocator is a type, not a callable) or @intCast(usize, limit), which actually was valid syntax in earlier versions.
    4. Systematic convention gaps: Models would regularly put in <?hh for the hack samples, which broke in modern Hack.

    My takeaway from this is that models learn to code, not just to reproduce syntax. That means you can almost certainly post-train or prompt your way out of most programming language problems with any frontier model: while some models were still pretty poor at Zig even with a lot of tries, Gemini most certainly was not. I doubt the folks at GDM spent a whole lot of time on Zig evals7.

    A well pre-trained model has broad capabilities in programming, and it’s mostly a case of eliciting them rather than having to teach them.

    1. I’m going to take as a given that models are good at generalizing within the distribution of their training data, and poor at generalizing outside it. This is not settled! Reasonable people can disagree! But, it’s a decent starting point. ↩︎
    2. Claude. ↩︎
    3. Not actually project Euler. I confirmed that the models never respond with an actual Euler puzzle answer in the incorrect ones, so I’m fairly (this is not good science) sure it wasn’t memorization. ↩︎
    4. OLMo’s full training corpus (Dolma v1.7) includes a massive web crawl in addition to code-specific data from StarCoder, so the 0.00 GB for Hack means “absent from code specific training ” not “absent from all training data”. Hack documentation and other content are almost certainly present in the web crawl portion. ↩︎
    5. Gemini also reasons, but the 2.5 Flash model was doing minimal reasoning when answering.
      ↩︎
    6. Somehow averaging over 3k per sample for GLM, I say while ruefully staring at my OpenRouter bill. ↩︎
    7. By posting this on the internet I am guaranteed to be corrected, at length, by a Googler ↩︎
  • TileIR

    There are a lot of things folks do on GPUs (including, sometimes, graphics) so I have an approximately-correct taxonomy of operations to group them in to:

    1. Dense compute: A matmul or a convolution.
    2. Map: Elementwise/pointwise work on each value of a tensor.
    3. Reduce: Process a tensor into fewer dimension, like a sum.
    4. Transforms: Change data structure. Easy ones like transpose, annoying ones like scatter/gather.
    5. Synchronize / Communicate: Move data, or wait for it (copies, collectives, fences/barriers).

    At the moment people are pouring billions of dollars into hardware that primarily does 1. And, at the same time, many of the greatest minds of our generation are attempting to ensure that the hardware spends as much time doing 1 as possible.

    The biggest barrier to doing a lot of dense compute is 5: moving things in and out of memory. Launching kernels, transferring data between host and device (or between devices), moving data between global memory and registers and so on. It’s like there’s a box defined by Data × Footprint 1 × Time, and everyone is trying to keep it as full as possible.

    This involves tradeoffs. You want to mul as many mats as you can, but you only have so much room to store accumulators. Fetching new data from memory also takes a while. You can keep many in-flight fetches around, but each one expands the kernel Footprint, lowering occupancy.

    There are 3 tricks that we can use to help fill up the box by stitching different operations together:

    • Epilogue fusion: take an elementwise op and fuse it onto the end of a dense op, so that when the MMA produces output, the elementwise op can be run while the output data is still in registers. A classic example: fuse the activation after the dense compute in a feed-forward net.
    • Vertical fusion: take two subsequent operations and chain them together to avoid running a loop for one, writing it back, then running a loop for the other2. A classic example is Fused LayerNorm: normally you’d need to add elements, then collect stats for the normalization. You can fuse the two to collect the stats as you add the residual.
    • Horizontal fusion: doing different things over the same data, in parallel. The Q, K, and V projections in a transformer all need the exact same input, so are good candidates to fuse horizontally.

    You rely on the design of the hardware to enable some of this. For example, an epilogue fusion is beneficial because it’s one kernel launch instead of two, and because the work doesn’t need to be written back to global memory, but also because the epilogue can overlap with other work.

    It’s not always obvious how to put these fusions together. Flash Attention was such a breakthrough because it made dense op fusion possible. The naive attention block has a softmax in the middle: Softmax(QK^T / √d) · V. That softmax is a reduction op, which means it needs all of QK^T to be computed first, a pretty large matrix. Tri Dao and colleagues realized that if you used online softmax you could calculate the softmax for subsets of the QK matrix, and avoid materializing the whole thing. They enabled fusing the QK into the softmax and the V in one kernel, at the tile level.

    Tiles are the subsection of a matrix you’re working on at any given time. In a matmul, tiles from both input matrices are loaded and multiplied, to produce an output tile. There’s a useful image of this in the Nvidia blog post on cuTile, Nvidia’s most recent entrant into the the kernel-development landscape. To side-step concerns of plagiarism, I had nanobanana plagiarize it for me:

    Illustration depicting three matrices labeled A, B, and C, showing a looping process with arrows indicating active rows and columns in a computational context.

    cuTile is built on a well-specified intermediate representation called TileIR. There’s an experimental backend for Triton that lowers to TileIR too. While Triton is block-oriented rather than tile-oriented, in practice what you mostly work on in a thread-block is… a tile. TileIR elevates the tile to a first-class concept.

    You can see this by generating the same kernel against the regular backend and the TileIR backend. Triton’s intermediate representation (TTIR) uses pointer arithmetic: generating offsets, computing masks, loading from explicit addresses. Here’s a bit of an inner loop of a matmul. It groups up which data it wants, loads the tiles a and b by pointer, and computes the dot product:

    %offs_m = tt.make_range {end = 128, start = 0} : tensor<128xi32>
    %a_ptrs = tt.expand_dims %offs_m {axis = 1} : tensor<128xi32> -> tensor<128x1xi32>
    %a_ptrs_1 = tt.splat %stride_am : i32 -> tensor<128x1xi32>
    %a_ptrs_2 = arith.muli %a_ptrs, %a_ptrs_1 : tensor<128x1xi32>
    ...
    scf.for %k = %c0 to %num_k step %c1 iter_args(%acc = %zero) -> tensor<128x128xf32>:
    %a = tt.load %a_ptrs, %mask : tensor<128x64x!tt.ptr<f16>>
    %b = tt.load %b_ptrs, %mask : tensor<64x128x!tt.ptr<f16>>
    %acc_new = tt.dot %a, %b, %acc : tensor<128x64xf16> * tensor<64x128xf16> -> tensor<128x128xf32>

    TileIR on the other hand preserves the tile as a semantic object. This snippetis doing exactly the same thing, but this representation elides the pointer math and masking:

    $33: Tile[float32,(128,128)] = typed_const(value=0)
    accumulator.2: Tile[float32,(128,128)] = for k.1 in $36 (with accumulator.0 = $33)
    do
    $50: Tile[float16,(128,64)] = tile_load_token_ordered(array=A, index=($7, k.1), shape=(128,64))
    $64: Tile[float16,(64,128)] = tile_load_token_ordered(array=B, index=(k.1, $10), shape=(64,128))
    $72: Tile[float32,(128,128)] = tile_mma(x=$50, y=$64, acc=accumulator.0)
    continue $72

    This is a nice IR (compact!), but from my perspective the most interesting part is that load function: tile_load_token_ordered.

    The “time” dimension of the Data × Footprint × Time box is the hardest one to manage. Time questions separate performant kernels from slow ones: When to prefetch, how to overlap loads, and so on. Since the advent of warp specialization, the Triton compiler has been exploring pipelining options through heuristics and autotuning, and kernel engineers have been going straight to the hardware with explicit barrier API extensions like TLX3 and Gluon4 .

    TileIR goes a somewhat different route. It assumes an unordered memory model: the order your code is written in does not determine when data is actually available. Instead, each memory operation returns a token and you attach semantics to it: read-only, write-only, read-write and so on.

    By being explicit about memory dependencies you give the compiler freedom to manage the Time dimension. Where accesses don’t overlap the compiler can freely reorder them. Where they do, the token chain tells the compiler exactly what depends on what. The kernel expresses intent; the compiler maps that to the hardware.

    TileIR is (mostly) targeting Blackwell right now, and the experimental backend is still early. The open question is whether we can express this smoothly enough in the syntax of kernels to actually enable taking the same kernel across hardware, or whether we are just adding some syntactic-sugar to avoid when doing hardware-specific tuning.

    That said, the idea feels pretty right? The tile is the unit with which we can express what we actually mean about memory, ordering, and fusion. The CUDA programming model was always about bounded-linearity within a massively parallel framework, and this loosens the bounds that little bit more.

    1. Use of available registers and shared memory, for example ↩︎
    2. This is loop fusion, in compiler terms. There are other things you can do, but this is the big one. ↩︎
    3. Triton Language Extensions, from Meta. As a disclaimer, these are the folks I work with. ↩︎
    4. From OpenAI ↩︎
  • What is In-Distribution

    What is In-Distribution

    One of the persistent questions in model development is whether reasoning actually involves… reasoning. As in: are we seeing actual logical conclusions, or just better recall of knowledge and patterns from the training set? LLMs are trained on, roughly, the web, which makes answering that question tricky: almost everything shows up in some form. A model that appears to “reason” through a physics problem could just be pattern-matching an irritated Reddit reply it saw during training.

    On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models takes a look at this question methodically.

    To this end, we build a fully controlled framework that isolates the contributions of each training stage. Our design is based on three principles: (i) fully controllable synthetic reasoning tasks with explicit atomic operations and DAG-defined dependency structure; (ii) observable, parseable reasoning processes enabling process-level evaluation and reducing reward or evaluation hacking; and (iii) systematic manipulation of pre-/mid-/post-training distributions to attribute causal effects to each stage.

    The authors break the problem of reasoning and training data down along two dimensions.

    1) Breadth-wise: can the model generalize from one type of problem to another (structurally similar) one in a different domain?
    2) Depth-wise: can the model reason correctly for longer, and hence solve harder problems?

    Rather than train on the internet, they build synthetic Math-puzzle reasoning tasks using a dependency-graph framework inspired by GSM-Infinite. By varying the depth of the reasoning chains required, and by generating structurally equivalent tasks across different domains, they try to tease apart those two aspects and investigate them separately.

    For the breadth side the model needs to generalize, to transfer learning across domains. The paper finds that the target domain has to be “in-distribution: the model has to have some examples in the pretraining set. They test this by using pass@128: if you give the pre-trained model 128 attempts, does it get the answer right even once? If so, you can use reinforcement learning or SFT to help the model get reliably better.

    It’s a bit like having studied Spanish at some point and forgotten albóndigas, the word for meatballs. If, for dietary preference reasons, you came to use that word regularly it would likely lodge itself in your brain more easily and you’d go from a lowish chance of getting it right to a much higher one.

    The paper is saying you must have this baseline in their to amplify with RL. Daniel Han of Unsloth describes this by saying with RL “luck is all you need”. If the model never gets the answer right, there is nothing much to reinforce (and you are stuck with paella).

    Depth on the other hand does seem to something we can kinda make up in post-training. Even if a model has only been pre-trained on problems up to a certain complexity, post-training on harder problems consistently enables it to solve them. The model is able to compose more complex patterns based on the simpler ones in its training set1. To continued our tortured analogy, this is more like being reminded of several Spanish words and, over time, learning to stick them together into actual sentences.

    Practically this means your pre-training data is a bet on what the model will ever be able to reason about, and post-training refines how well and how hard it can think within those domains.

    That approach also gives a useful tool for identifying whether something is in-distribution. If you want to know whether a model can learn a new capability through post-training, check pass@128 first. If it never gets the answer right in 128 attempts, you probably have a pre-training gap, not an RL problem.

    1. The paper also spends a while justifying curriculum training, giving the model problems just on the edge of its capabilities before introducing harder ones. Recent work from the FAIR Paris folks and others show you can somewhat automate this by generating problems from the same model you are training! ↩︎
  • You can just do things, but you don’t have to

    Every big software engineering team right now is racing to out-do themselves on their adoption of agentic coding practices, and ship faster. There is something more insidious going on with many of the software engineers I talk to1 though. A lot of pressure to build “more! faster!” comes from themselves.

    This shows up all over: the “you only have 2 years to escape the permanent underclass” meme2, or the various breathless LinkedIn or Twitter posts of 996’ing startups, labs, or particularly obsessive interns.

    Things that used to require teams can now be done by a sufficiently keen solo engineer with a gang of Claudes, or Codexes, or a K2 agentic-swarms. That is thrilling, and it opens up the door to projects that you wouldn’t normally have bothered building. But it also open the door to thinking you need to build those things, and that’s not quite the same.

    One of the observations of most people that take an extended leave from a large corporation is that much of the work they were doing wasn’t all that important. Either no one did it while they were out, or how they left it was… fine. Yet, much of that work somehow regains urgency as they come back to the role.

    It’s very hard to tease apart how much of your output actually matters. Coordinating a large group of people inevitably takes overhead, and so many annoying aspects of work are genuinely important. But, much like Wanamaker’s famous quote about advertising, half of the work you do doesn’t matter, the trouble is you don’t know which half.

    Adding a helpful and harmless model to the mix can certainly accelerate the rate of output, but it doesn’t do much about determining which bucket the work goes into. In fact, I’d say that the problems you take on when given a Max subscription are mildly more likely to to be things that haven’t been done because they are not worth doing. The combination of increased capacity and a pervasive sense of urgency is not a great recipe for quality decision making, or for a healthy relationship with your work.

    It can be helpful to take the outsider perspective, at work or with personal projects. Would ask you someone else to do whatever you are considering, even with the expectation they would leverage agents to help them?

    It’s often easier to see the value in something, or lack thereof, if you have to convince someone else of it. That can save you from some rabbit-holes filled with a sense of obligation to “extract value” from the time you already sunk into a misguided project.

    This doesn’t mean you should ignore all of the ideas you have: you really can just do things, and you sometimes should! Just be clear about whether you want to spend your time3 that way, regardless of what the agent is doing.

    Footnote:

    1. Including myself! ↩︎
    2. I appreciate Scott Alexander’s contribution on this topic: “You only have X years to escape permanent moon ownership” ↩︎
    3. I didn’t actually quote him but everything about this article feels like a poor software engineer’s Oliver Burkeman, so you should just read him. ↩︎
  • Do MoEs Think Different?

    When I was writing recently about MoEs I was focused mostly on the architectural reasons that we use them. One thing I hadn’t considered is that they might actually be better at learning as well.

    Meanwhile, Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models

    Our findings reveal that MoE architectures form a low entropy backbone of consistently reinforced neurons, which leads to an early consolidation of their importance profiles and, in turn, underpins their functional robustness. This resilience, stemming from more distributed knowledge storage, contrasts with the greater brittleness and knowledge concentration in the dense model. These phenomena collectively demonstrate that architectural sparsity is not merely a computational shortcut but also acts as a useful inductive bias that fosters stable and robust learning

    To land that somewhere between academic prose and GPT-speak1 the results of the paper are suggesting that MoEs learn more effectively, and store their core knowledge more robustly.

    They measure this with Log-Probability Increase (LPI), which lets you estimate how much each column in the output projection for a layer in the model contributes to the final score. It gives you a sense of how much smarter the model gets from that specific chunk of the weights2. They track this “neuron importance” measure over multiple checkpoints using the (very!) open models from AI2, OLMo-7B and OLMoE-1B-7B.

    In the MoE the set of important weights is both more stable and stabilizes earlier in training: the model develops a core of understanding and builds on that. This might mean MoE training is genuinely more effective than dense. The dense model is regularly thrashing its core understanding as updates come in, while the MoE protects it and lets the model focus more on nuance.

    Or! It might be entirely an artifact of model differences. As the authors note the two models are quite different: different training data sets, different lengths of training, and different depths (16 vs 32 layers), as well as, you know, being an MoE or not. Finally, the actual LPI version they use3, Gated-LPI, bakes in the MoE routing. It’s not totally clear whether we are seeing “neurons that matter”, or mostly seeing “routing patterns that matter”.

    I do think4 this is likely showing something interesting, even with some skepticism. The “smearing” of knowledge across weights is how I described what we are trying to avoid with MoEs, and it may be useful to have a more mechanistic understanding of how that actually happens. The authors observe that the stability curve rises, drops and consolidates. Even if this is just an artifact of routing, it’s quite possible there is a critical phase in the training where that routing locks-in.

    If that idea is right, we might already be shaping that phase. The load-balancing tricks that made MoEs practical could be doing double duty as scaffolds for learning.

    1. Sparsity is not just a shortcut — it’s crucial to learning ↩︎
    2. For a given prompt. They actually use some fairly advanced evals for this, rather than the general basic benchmarks ↩︎
    3. And created, to make it plausible to do this work! ↩︎
    4. Do not draw any research conclusions based on this website ↩︎
  • Anyone got any Veras?

    In the heady world of AI progress, context lengths have seen somewhat more languid growth. After rapid progress up to the 100-300k token range, they’ve largely stayed there for frontier models. We now have a couple of 1m token models that appear economically viable1, with Gemini and Sonnet, but Opus 4.5 (for example) stuck with the 200k window of its predecessor.

    The fundamental challenge with long contexts is the interaction between tokens, particularly in the prefill (prompt processing) phase where you have to do this for a whole lot of tokens at once before you can generate anything.

    For each token attention calculates:

    1) The key: when to use this token
    2) The value: what information this token contributes
    3) A query: what each token is looking for

    Each2 token’s query is compared against prior tokens’ keys to get weighted scores; the resulting weights mix those tokens’ values.

    Then, in decoding you make this calculation repeatedly. The 500th token has a new Key, Value and Query, but the 1st token has the same.

    It turns out you can save yourself a lot of work by just keeping around the Keys and Values from the previous generation and loading it in for the prior tokens. Then you just have to update for the newly added token. This happens for every layer in the model, so it’s a significant amount of computation saved.

    Of course, you have to stick that cached copy somewhere. Because it’s used in each round of generation it needs to be rapidly available, to avoid adding a bunch of latency. In practicality that means it has to be in the high bandwidth memory, which is a scarce resource. So the longer the context the more memory you need to hold it, and the more memory you need for a bigger cache.

    Larger context windows have been unlocked in large part by more memory on the card and in a somewhat smaller part by more rapid scale-out interconnect like NVlink3.

    Meanwhile, here’s Uncle Jensen at CES, via Stratchery‘s excellent analysis of the announcements:

    this context memory, which started out fitting inside an HBM, is no longer large enough. Last year we created Grace Blackwell’s very fast memory, we call fast context memory. That’s the reason why we connected Grace directly to Hopper. That’s why we connected Grace directly to Blackwell, so that we can expand the context memory. But even that is not enough, and so the next solution, of course, is to go off onto the network, the north-south network, off to the storage of the company. But if you have a whole lot of AIs running at the same time, that network is no longer going to be fast enough. So the answer is very clearly to do it different, and so we created Bluefield-4 so that we could essentially have a very fast KV cache context memory store right in the rack.

    It’s quite possible this kind of in-rack memory will unlock significantly larger context windows. I do wonder what this will mean for actually using long-context models. Dealing with multiple-million tokens of context is still going to take a bit of time to process. For the kind of interactive use cases that have worked best with LLMs (Claude Code, Computer Use, Cowork etc.) I suspect latency will be a bit of a pain point.

    What is kind of interesting is that all the providers at this point have some form of prompt caching option. Most of the time with a KV cache you build it up as you go, but in some cases you are going to actually generate the exact same cache in multiple different sessions. A good example would be a long system prompt: you can generate the KV cache for that, stick it on slower memory4 and then load it in to HBM for a new session. This can save a bunch of compute, and is very practical for a lot of use cases.

    One interesting thing this might do is enable “whole codebase” type queries: the vast majority of assets (e.g. code) in a given work session won’t change, so you could cache the KV for everything, and have it in context for later use

    I’m hopeful that as Blackwell, TPUv7 and MI450 come online we will see context lengths unstick and move up, and perhaps with Vera Rubin we will really get rid of “compacting” for some practical set of cases.

    1. So many asterisks should go here after this flagrant assertion ↩︎
    2. Technically in most cases this is between each token’s Query and the Keys of the tokens before it, thanks to causal masking ↩︎
    3. You have to do some work to distribute things of course, but if your model is multi-card anyway, then you can distribute the KV cache fairly easily. TPUs have chonky scale-out bandwidth, probably one of the reasons Google was able to offer 1M first. ↩︎
    4. For clarity, this might not actually be how its implemented at Throppy/Google/OAI, they might just keep it in HBM anyway. But it feels like you could do that? ↩︎