Blog

Qwen-Image
GPT4o’s image generation was a remarkable event, beyond the brief Ghiblification of all social media.GPT-4o offered significantly more steerability than earlier image generation models,, while offering image quality in the ball park of the best diffusion models. Qwen-Image gives a similar level of fidelity and accuracy and is an open-weights model with a pretty decent technical report: QwenLM/Qwen-Image.

While I was fairly familiar with diffusion models, I wasn’t really familiar with the backbone of this model, the multimodal diffusion transformer (MMDiT). Rather than just look at it, I vibed up a repo with Claude Code that went step by step through the architectures, training on good old MNIST. ianbarber/diffusion-edu — which spat out this:

This ended up being a helpful way to go step by step through the evolution of diffusion models.

Loss/Target

Modern image generation really kicked off with GANs. GANs were a clever idea that exploited the fact that we are better at building classifiers than generators by using one to bootstrap the other. A generator would generate an image against a reference, the discriminator would be given the generated image and the reference and have to predict which was the real one, and both networks were scored on how well they did on their tasks. This was effective, but was challenging to train. The generator also had to start from somewhere and what it effectively started from was noise: the generate would start with fairly random output and the discriminator would learn to identify noise vs the real image.

The clever idea Jonathan Ho and co had with DDPM was to focus on that noise: what if instead of learning to generate images we learned to remove noise from images. In the snippet below we:
- Pick a timestep between 0 and 1000
- Generate some noise
- Add an amount of noise to the training image proportional to the timestep
- Get the model to predict the noise, given the time step
- Calculate the loss as the mean squared error between the known noise and the predicted noise
```
# Sample random timestep
t = torch.randint(0, 1000, (B,), device=device)

# Add noise to image
eps = torch.randn_like(x0)
alpha_t = self.alpha_schedule(t)
xt = sqrt(alpha_t) * x0 + sqrt(1 - alpha_t) * eps

# Predict the noise we just added
eps_pred = self.model(xt, t, cond)

return F.mse_loss(eps_pred, eps)  
```
This pretty much worked! You needed to use quite a few timesteps (around 1000), but the model would learn to discriminate noise from data. Then, you can reverse the process to generate: given a noisy starting point, generate some noise, predict the noise at the first timestep, remove it, increment the timestep, then repeat, each time adding some noise and removing.

Song et al. followed this up with DDIM, identifying that one of the reasons you need so many samples is that you are injecting new noise each generation. If you start with the noise up front when sampling you have a much more deterministic process, and can generate in more like 50 steps than 1000:
```
x = torch.randn(*x_shape)  # Start with pure noise

for i in reversed(range(steps)):
  t = torch.full((B,), i/steps)
  if target == TargetMode.EPS:
    eps = model(x, t, cond)
    x = (x - eps * dt) / sqrt(1 - dt)
```
The next step, in 2021, was Classifier-Free Guidance from Ho and Salimans. The clever idea was to pass a conditioning variable through to the model: for example in our MNIST example it could be the digit we are learning from. However, during training we would sometimes zero it out. This means the model learns to generate conditionally (for the specific digit) and unconditionally (just in whichever direction looks the best).
```
if cond is not None and self.cfg_dropout_prob > 0:
  mask = torch.rand(B, 1, 1) < self.cfg_dropout_prob

  cond = cond * ~mask  # Zero out conditioning randomly

return F.mse_loss(self.model(xt, t, cond), target)
```
This gets useful at generation time. When sampling, we can sample both conditionally and unconditionally, and diff out the unconditioned part:
```
# Run model twice: with and without conditioning
cond_pred = model(x, t, cond)
uncond_pred = model(x, t, None)

# Amplify the difference
return uncond_pred + cfg_scale * (cond_pred - uncond_pred)
```
If you imagine the sampling process as denoising, this is saying there is the “best” direction, given the condition, and the “best direction” overall. By reducing the influence of the overall best direction, we get clearer steerability, and effectively the model serves as its own iterative classifier.

Also in 2021, Song et al published Score-Based Generative Modeling through Stochastic Differential Equations. They framed the diffusion problem as a Stochastic Differential Equation (SDE), effectively a regular differential equation dx = f(x, t)dt with an additional noise term: dx = f(x, t)dt + g(t)dw¹. That latter term g(t) controls how much random noise is involved.

The contribution from the paper is that they worked out how to reframe this without that dw noise – e.g. they turned it into an “Ordinary” Differential Equation (ODE) without the random component. Then the model can be viewed as a velocity field that ends up having a similar shape to the one modelled by the random noise version, but is deterministic.

Salimans & Ho were not done, and proposed another improvement to loss in V-Parameterization in the Imagen paper. One of the challenges with predicting the noise (eps above) is that when you get pretty close to a finished image there isn’t much noise, so the prediction isn’t particularly good. Similarly, when you are starting with pure noise it’s predicting almost everything, so also doesn’t give much information. Predicting the noise involves estimating both the clean sample and the noise. Some reordering lets you predict a single value, the velocity field (or gradients) which combines the clean sample (alpha), the noise (eps) the time step and the current (noised) sample. By having the model predict that we can balance between predicting the image and the noise, giving better results better at extremes.
```
v_target = alpha_t * eps - sigma_t * x0
v_pred = self.model(xt, t, cond)

return F.mse_loss(v_pred, v_target)
```
Finally (on the loss) we get to flow matching from folks at Meta FAIR (Flow matching) and UT Austin (Rectified Flow). Rather than making the target a blend of start and noise, why don’t we just predict the straight path to the data. Compare the v_target below to the one above:
```
t = torch.rand(B, 1, 1, 1)
z = torch.randn_like(x0)

# Straight line: xt = (1-t)*x0 + t*z
xt = (1 - t) * x0 + t * z

# Learn the velocity field pointing from noise to data
v_target = x0 - z  # The straight path direction
v_pred = self.model(xt, t.squeeze(), cond)

return F.mse_loss(v_pred, v_target)
```
Flow matching models often converge faster during training and can generate good samples with fewer steps. They also tend to have more consistent quality across different sampling step counts.

Architecture

All of that evolution was about the loss function and sampling, and we haven’t really discussed the model architecture itself. The original diffusion models used an approach called Unets: a series of convolutions that compressed the (latent) visual information into fewer dimensions, then expanded it back up (giving a sort of U shape). But post-ChatGPT the Transformer was ascendent, so in 2023 Peebles and Xie proposed swapping out the Unet for a stack of transformers in the Diffusion Transformers (DiT) paper.
```
class DiTTiny(nn.Module):
    def __init__(self, embed_dim=256, depth=6):
        # Patchify the image (like ViT)
        self.patch_embed = PatchEmbed(patch_size=2)

        # Stack of transformer blocks
        self.blocks = nn.ModuleList([
         TransformerBlock(embed_dim) for _ in range(depth)
        ])

    def forward(self, x, t, cond=None):
        # Convert image to patches
        x = self.patch_embed(x)  # (B, num_patches, embed_dim)

        # Add positional encoding
        x = x + self.pos_embed

        # Transform through attention layers
        for block in self.blocks:
            x = block(x, t_emb)

        # Reshape back to image
        return self.unpatchify(x)
```
This looks like a regular transformer but with patches (segments of the image) rather than text tokens, as in ViT understanding models. The transformer block will also look pretty familiar
```
class TransformerBlock(nn.Module):
  def __init__(self, dim, heads=8, mlp_ratio=4.0):
    super().__init__()
    self.ln1 = nn.LayerNorm(dim)
    self.attn = nn.MultiheadAttention(dim, heads, batch_first=True)
    self.ln2 = nn.LayerNorm(dim)
    self.mlp = nn.Sequential(
      nn.Linear(dim, int(dim*mlp_ratio)), nn.GELU(), nn.Linear(int(dim*mlp_ratio), dim)
  )

  def forward(self, x):
    h = self.ln1(x)
    x = x + self.attn(h, h, h, need_weights=False)[0]
    x = x + self.mlp(self.ln2(x))
    
    return x
```
They got good results and more importantly it was easier to scale up to more compute and larger inputs. For what it’s worth, I found DiTs a bit tricky for training on small data sets (like the mnist example), but didn’t spend much time on it, since:

MMDiTs emerged in 2024, and were used for Stable Diffusion 3 and Flux, largely setting the standard in terms of image quality. The idea is to process images and text in parallel with the ability to attend across each other, reminiscent of cross-encoder models.
```
class MMDiTTiny(nn.Module):
    def __init__(self, img_dim=256, txt_dim=256):
        # Separate encoders for each modality
        self.img_encoder = nn.Linear(patch_dim, img_dim)
        self.txt_encoder = nn.Linear(txt_dim, txt_dim)

        # Joint transformer blocks
        self.blocks = nn.ModuleList([
            CrossTransformerBlock(img_dim, txt_dim) for _ in range(depth)
        ])

    def forward(self, img, t, txt=None):
        # Process both modalities
        img_tokens = self.img_encoder(patchify(img))
        txt_tokens = self.txt_encoder(txt) if txt is not None else None

        # Bidirectional attention between modalities
        for block in self.blocks:
            img_tokens, txt_tokens = block(img_tokens, txt_tokens, t)

        return unpatchify(img_tokens)
```
MMDiT models demonstrate great prompt adherence and can handle complex requests. The bidirectional flow means text understanding improves alongside image generation.
```
class CrossTransformerBlock(nn.Module):
"""Cross‑attention: query=image tokens, key/value = text tokens."""

    def __init__(self, dim_img, dim_txt, heads=8, mlp_ratio=4.0):
        super().__init__()
        self.q_proj = nn.Linear(dim_img, dim_img)
        self.k_proj = nn.Linear(dim_txt, dim_img)
        self.v_proj = nn.Linear(dim_txt, dim_img)

        self.attn = nn.MultiheadAttention(dim_img, heads, batch_first=True)

        self.ln_q = nn.LayerNorm(dim_img)
        self.ln = nn.LayerNorm(dim_img)
        self.mlp = nn.Sequential(
            nn.Linear(dim_img, int(dim_img*mlp_ratio)), nn.GELU(), nn.Linear(int(dim_img*mlp_ratio), dim_img)
        )

    def forward(self, x_img, x_txt):
        q = self.q_proj(self.ln_q(x_img))
        k = self.k_proj(x_txt)
        v = self.v_proj(x_txt)

        x = x_img + self.attn(q, k, v, need_weights=False)[0]
        x = x + self.mlp(self.ln(x))

        return x
```
Here, in the cross attention block the image is used for the Query part and the text for the K and V parts of the attention. The results are combined with the image input.

Putting this all together, you can see the evolution of the common diffusion baselines across both scale and steerability:
1. DDPM: Clean but slow. The baseline everything else improves on.
2. SD1-style (UNet + Epsilon + CFG): The first practical system. Good quality, reasonable speed, follows prompts well with CFG.
3. SD2-style (UNet + V-param + CFG): Slightly better contrast and stability, especially at high resolutions.
4. SD3-style (MMDiT + Flow): The current state-of-the-art. Fastest training, best prompt adherence, most efficient sampling.
Back to Qwen

The Qwen-Image model is a good, practical example of scaling this up. It uses an existing multimodal model² () to encode text and image inputs, a pretrained VAE³ for translating between pixel and latent space, and then as its backbone an MMDiT. The use of strong (understanding) models for encoding helps really enhance the steerability of the results in the MMDiT.

In the MMDiT sketch above we just concatenate image and text together. In real systems we first add the positional embeddings for the image tokens, then add on text tokens. This works, but made it difficult to adapt to different image resolutions.

Seedream introduced Scaling RoPE⁴ that instead puts the image positional encoding in the middle of the image, treats the text tokens as 2D shapes [1,L], then applied 2D RoPE to both text and image tokens. This worked better, but had some problems where the positions were confusable between text and image latents, meaning the model couldn’t properly differentiate in some cases. The Qwen team updates this by implementing positional encoding across both dimensions of the text tokens, and concatenating the text along the diagonal of the image:

This design allows MSRoPE to leverage resolution scaling advantages on the image side while maintaining functional equivalence to 1D-RoPE on the text side, thereby obviating the need to determine the optimal positional encoding for text.

The resolution independence is important for the training recipe. The model is progressively trained with images starting at 256×256 and increasing in steps up to 1328x, in a variety of aspect ratios. They follow it up with post-training consisting of SFT on organized, high quality image-text pairs and DPO against preference pairs judged by human raters⁵. Finally, they do a GRPO stage with a “reward model”: though it isn’t clear if that’s based on the aforementioned preference data or is some other secret sauce.

While we don’t know how GPT-image is trained, this recipe certainly gave some comparable results. I was surprised to learn that the combination of a strong text and image encoding model plus MMDiT⁶ gives this level of steerability and fidelity. As usual, it’s exciting to have open models and papers to bring these concepts together!
1. Its w because the noise is a Weiner process, also known as standard Brownian motion. I am heavily conditioned to think of this as the motion in a cup of tea thanks to HHGTTG
  ↩︎
2. Qwen 2.5-VL ↩︎
3. Interestingly, a video one from Wan-2.1 ↩︎
4. Roughly the same idea was about as Column-wise position encoding as I understand it.
  ↩︎
5. The same prompt with two different seeds, and — if present — a reference image
  ↩︎
6. And a lot of very carefully curated and programmatically generated data, to be fair
  ↩︎
September 29, 2025

Modelling
Automation & Managerial Control
There’s a chart making the rounds that caused Tim Lee over at Understanding AI to rewrite his recent (excellent!) article about the impact of AI on jobs. MIT’s Erik Brynjolfsson and colleagues found¹ that young workers in AI-exposed jobs² have seen their employment drop by 13% since ChatGPT arrived. Meanwhile, their older colleagues in the same fields are doing just fine.

[…] the youngest workers saw dramatic job losses—but only if they worked in occupations (like accountants or computer programmers) that were highly exposed to AI. Young workers in less exposed occupations (like nurses or construction workers) saw normal employment growth over the same period.

From a tech industry focus, it’s a little hard to disentangle the impact of reduced hiring after layoffs ³ from the growth of AI, but likely both had an impact. AI coding agents are making it easier to complete the kind of introductory tasks that might have been left for junior engineers.

New grads don’t just do simple tasks though, they grow and develop tacit knowledge of their company industry, begging the question is whether this is permanent disruption or temporary dislocation as the skills need shifts. As Tim calls out:

It’s important not to read too much into this research. Workers between the ages of 22 and 25 are a small slice of the job market, and their employment has always been more volatile than for older workers. When I graduated with a computer science degree in 2002, the economy was just emerging from the recession that followed the dot-com bubble. It was a hard time for a young adult to get their first programming job, but most of my peers eventually found work in the field.

To give an analogy, there was a time when becoming a junior programmer meant learning how to write fast code as cycles were too important to waste. Now, writing particularly efficient code is largely the preserve of specialist, more senior people: some folks opt in to that route early because of their personal interests, but in general raw performance of code is not the blocking factor to building something valuable.

My sense is we are seeing the same thing in terms of general “program composition”: senior folks with experience on large, collaborative projects can benefit from LLM automation as they understand how to put in the right project guardrails and how to translate needs into technical direction. Junior people are still mostly trained how to write working code, and that need has become less pressing as LLMs have proved moderately competent at it.

Rodney Brooks, the robotics legend, made a point back in 2018 that stuck with me: it’s not automation that disrupts workers—it’s digitalization. In his article, Brooks wrote

Digitalization is replacing old methods of sharing information or the flow of control within a processes, with computer code, perhaps thousands of different programs running on hundreds or thousands of computers, that make that flow of information or control process amenable to new variations and rapid redefinition by loading new versions of code into the network of computers.

An example that Brooks uses is bridge toll takers. This directly happened on the Bay Bridge between San Francisco and Oakland, which used to employ toll takers in booths. Then FastTrak was added, allowing passing through without interacting with anyone, while still offering cash tolls for those without. Now, between that and direct mail to people via cameras watching license plates, the tollbooths are empty.

LLMs also digitalize. Task descriptions and project documentation, for example, have been stored in human language: digital, but not particularly accessible to automation. Much of the work of managing a large bug tracking system has been in adding metadata that is accessible to automation. LLMs digitalize language, imperfectly to be sure, but enough to expose new swathes of work to automation.

High Road/Low Road

How will companies respond? Thomas Kochan at MIT has been mapping this kind of choice for years, and describes the separation between what he called the high road and low road.

The language that was used to differentiate these two approaches quickly evolved to a comparison of “high road” and “low-road” business strategies and “high-performance work systems,” which viewed labor as an asset, versus “command and control” systems, which viewed labor as a cost like any other factor of production. A comparison of the business strategies of two household names, Walmart and Costco, illustrates the differences between low-road and high-road business strategies. Walmart has been extremely successful (when judged solely on the grounds of finances and shareholder value) by pursuing a business strategy best captured by its marketing tag line: “Every day low prices.” To achieve this strategy, it places top priority on minimizing and tightly controlling labor costs, discouraging long-term tenure of its “associates,” investing little in training and development, and avoiding unions at all costs. Costco’s business strategy places a higher value on product quality and customer service, and to achieve these objectives it pays higher wages, invests more in training its work force to understand and serve customer needs, and has longer tenure patterns (and thus lower turnover costs). As a result, Costco’s employees are more productive, stay with the firm longer, and have more discretion to use their time and knowledge to solve customer problems.

Tech companies have, in the most part, been high-road employers. Employees have been an asset, and in some cases the key asset of the company. The low road though is not simply driven by cost cutting, it’s about control. Having a more fungible, replaceable workforce gives executives more options. Having more specialized, skilled workers offers the options of more flexibility in how work is done, but shifts control to the workers and away from management.

We can see this play out in some of the post-pandemic cultural changes. There is a concept in work called deskilling, where work is atomized to improve efficiency: take something which was a skill and divide it up until it until the individual components becomes unskilled. Classic examples are in factory work, where a skilled person is replaced with an operator of a machine, or more often a series of operators of a series of machines⁴. This trades a higher up-front cost in terms of capital and procedure development for a lower labor cost, transferring both money but also power from workers to managers.

A recent article extended this concept to virtues, with the idea of “moral deskilling”. A virtue is a positive behavior, such as building responsibility or with high quality. Virtues tend to be individual qualities, things we recognize and reward in others: much of culture in a company is about inoculating virtues. That is inherently messy and the idea of systematizing virtue is appealing: move from a fuzzy, personal conception to a verifiable checklist or a rule that can be followed. This worked in a lot of cases, but it also enabled a form of deskilling:

Systematising virtue handed control to managers. Who, endlessly mistrusting these expert folk who were always trying to do things the expensive way, converted that mistrust into endless, endless paper work.

It was endless because it broke every little aspect of what had been virtue into tiny components. Fearful of losing control of any scrap of virtue, managers needed to relentless check on every little task.

If we want to see this play out in real-time we can look at the return-to-office mess in tech. A vibrant, collaborative office culture is a good thing, and it requires a compact. Employees would deal with the misery of a commute⁵ (particularly in the SF bay area), but in exchange they would participate in an environment where they could learn and teach, build camaraderie and so on.

When the idea of a return to office happened post-pandemic, people had found pleasure and benefit in not doing the commute. When they returned, they found the offices less vibrant, the workforce more distributed, and cost-driven reductions in space making the experience harder through shortages of meeting rooms or desks.

Compounded by a series of layoffs and a change in the prior relationship between company and employee, the in-office deal felt worse. Frustrated with the lack of the old compact, management exerted control through systems. They set required days and logged attendance through badge ins. Workers responded by treating the atomized requirements as mere requirements, not aspects of a culture: even a small percentage of folks coffee badging or trying to work from more convenient offices were visible in the empty desks, exacerbating tensions for workers “doing the right thing”.

Rather than analyze the problem and step back, management in many cases doubled down on systematizing: validating time at desk, logging badge out times or adding similar extra controls. This continued to take what had been a morally complex set of trade-offs and reduce it to a checklist. For many newer staff, that was the in-office experience.

This is the essence of the low road: prioritizing the systematized and legible over the messy, and complex, but more interesting, world of dealing with real people; prioritizing power and control over exploring new outcomes.

One way to view what’s happening is through the lens of debt, which is one of the angles in a recent position paper that frames the future of work as an AI Safety risk. Every time a company chooses to replace junior workers with LLMs rather than training them, they’re borrowing against the future. Matt Garman of AWS was pretty clear on his position:

“I was at a group, a leadership group and people were telling me they’re like we think that with AI we can replace all of our junior people in our company. I was like that’s the like one the dumbest thing I’ve ever heard […] They’re probably the least expensive employees you have. They’re the most leaned into your AI tools and like how’s that going to work when you go like 10 years in the future and you have no one that has built up or learned anything.”

But understanding something and acting on it are different things. Both the low road and high road can lead to a lot of success in business, but I do hope we can navigate this transition towards a place where the craft can be retained in software development. The question is whether enough companies will choose the messy, complex work of developing people over the appealing simplicity of trying to replace them.
1. Canaries in the Coal Mine? Six Facts about the Recent Employment Effects of Artificial Intelligence — Stanford Digital Economy Lab ↩︎
2. Like programming and accountancy, knowledge work fields that have a large amount of machine interaction ↩︎
3. As well as pandemic-driven overhiring and the end of zero interest rates ↩︎
4. Or now robots in entirely lights out factories for sufficiently high scale productions ↩︎
5. Particularly in the SF bay area! ↩︎
September 3, 2025

Economics
A Primer on Post-Training

A Primer on LLM Post-Training – PyTorch

Very excited to see this publicly available. David moved to the PyTorch team at the start of the year, having worked on Llama, and wrote up this excellent guide for post-training internally. This is a cleaned up version of the same doc, and provides a fantastic introduction to the world of post-training for modern LLMs.

It also includes one of my favorite perverse incentive examples:

Note: this happens with humans too! We just call these Perverse Incentives, but they are literally the same thing. The British government, concerned about the number of venomous cobras in Delhi, offered a bounty for every dead cobra. Initially, this was a successful strategy; large numbers of snakes were killed for the reward. Eventually, however, people began to breed cobras for income.

The real kicker in that one came when the government realized what was happening and canceled the bounty. The folks who had been breeding cobras didn’t want to look after them any more, so just released them, leading to a lot more cobras than there had been before!

September 2, 2025

links-and-recs
Layouts
You could have invented CuTe hierarchical layout (but maybe not the rest of it?) : ezyang’s blog

Ed posted the best intro to CuTe layouts I have seen, by showing how to extrapolate them from PyTorch striding¹.

Well, it turns out, this is exactly how CuTe layouts work! In CuTe, sizes/strides are hierarchical: a size is actually a tree of ints, where the hierarchy denotes internal structure of a dimension that you can address linearly (in fact, everything by default can be addressed in a 1-D linear way, even if its an N-D object.)

Relatedly, Simon Veitner put together a quite visual understanding of layouts. https://veitner.bearblog.dev/intuition-behind-hierarchical-layouts/ – the graphics are helpful once you have the baseline intuition from Ed’s post!
1. If you’re not familiar with striding, Ed’s PyTorch Internals talk/post remains the best intro! ↩︎
August 26, 2025

links-and-recs
The TPU book, on GPUs
How to Think About GPUs | How To Scale Your Model

The Jax “How To Scale Your Model” book is one of my favorite references for folks trying to get their head round pretraining¹. It breaks down the performance characteristics of model training (often using Llama 3 as an example) in an incredibly clear way. The only slight limitation is that it is primarily focused on scaling LLMs on TPUs: interesting, but probably not your main platform target (unless you work at Deepmind). They just released a new chapter covering GPUs, and it’s also a great summary².

There are also plenty of mildly snarky comments about design choices to leaven the reading too:

Takeaway: in theory, NVIDIA SHARP (available on most NVIDIA switches) should reduce the cost of an AllReduce on B bytes from about 2 * B / W to B / W. However, in practice we only see a roughly 30% improvement in bandwidth. Since pure AllReduces are fairly rare in LLMs, this is not especially useful.
1. Though they include a chapter on inference too! ↩︎
2. Though if you haven’t read the rest of the book it moves pretty fast – definitely best to read through the whole thing and treat this as the appendix it is intended to be! ↩︎
August 19, 2025

links-and-recs
Extending Arcee’s FM context length

Extending AFM-4.5B to 64k Context Length

Via Nathan Lambert, an extremely fun write up of the journey to an 64k context length for Arcee’s 4.5B foundation model. There are a lot of good takeaways, but this one particularly resonated with me:

Experimentation is Key: As in everything I write, I am unable to stress enough the importance of trying dumb things. If you try enough dumb things, eventually one of them will turn into a smart thing. Embrace the chaos.

August 13, 2025

links-and-recs
Rubrics
Pre-training is about making AI correct, post-training is about making AI helpful¹. That helpfulness is (primarily) shaped by reinforcement learning. RL for LLMs really took off with RLHF (RL from Human Feedback), which trained based on the score from a reward model.

The reward model was designed to score responses based on how well they met certain preferences, and the preferences were inferred from a set of human ratings: the graders were told what to look for in pairs of responses, and the reward model was trained to predict what they would pick. This worked, but was gated on how much signal you could get into the reward model and hence how many humans you had to generate preference data.

RLAIF (RL from AI Feedback) naturally extended this to using an LLM to make the preference picks rather than humans². Folks also started to use LLMs in an LLM-as-Judge pattern for evaluation after training: give the model a list of criteria, and ask it to rate how well the responses meet them.

The next notable step was RLVR (RL with Verifiable Rewards), which uses ground-truth data to provide rewards scores instead of a model. For example, a math problem might have a defined numeric answer, or a generated proof could be verified by a dedicated theorem prover program. This turned out to work very well for code and math and lead to the O-series of OpenAI models³ and many open reasoners, particularly Deepseek R1.

It’s a pretty natural idea to take a verifiable reward pipeline plug in AI scoring directly: rather than a model generate preference pairs and train a separate reward model, give the model criteria and ask it how well the response satisfies them. This means instead of letting a model work out what “good code” looks like from pairs of different (but similar!) solutions to a problem, you have a model working through a checklist, asking things like “Does it have types? Does it have comments? Would your coworkers hate you if you landed this?”

These checklists are referred to as rubrics and Snorkel have started an interesting looking blog series introducing rubrics, which offers a definition:
A rubric is a structured guide that spells out what “good” looks like for each response from an AI system.

A rubric consists of:
- A list of criteria: Does the code compile? Does it have comments?
- How the model performed on each criterion: “Compiles” could be yes/no. It could also be more nuanced: yes/yes with warnings/no.
- Scoring rules that turn performance into numbers: Clean = 0. Warnings = 1. No = 2.
In Nathan Lambert’s recent interview with Ross Taylor, Taylor calls rubrics out as an underappreciated research opportunity, particularly for agentic training:

Rubrics are underhyped on social media – they were driving force behind projects like DeepResearch – and GenRMs are interesting but perhaps slightly overhyped.

This caught my eye, as Moonshot leveraged rubric based rewards heavily in Kimi K2, notably using the model they were training as the judge of itself:

The framework operates using a Self-Critique Rubric Reward mechanism, where the model evaluates its own outputs to generate preference signals. To bootstrap K2 as a competent judge, we curated a mixture of open-source and in-house preference datasets and initialize its critic capability in the SFT stage.

One of the core values of rubrics is that they work for both LLMs and humans. You can iterate on rubrics with people, scale them with LLMs, and spot-check LLM results with human raters to ensure reliability.

The paper [2507.17746] Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains formalizes them as a full peer to Verifiable Rewards. The paper sets up rubrics so each criteria is a simple pass/fail and each has a predefined importance weight. They normalize everything so the system can’t get gamed by just adding more criteria⁴, and then plug in the resulting score in to the RL loop⁵.

Of course, you actually have to write the rubrics, which leads to a specificity versus generality tradeoff: take more time to write more rubrics or rely on fewer, more general ones. The RaR paper makes it clear that more is better:

predefined generic rubrics substantially underperform compared to prompt-specific ones, underscoring the importance of contextualization. Rubrics that include a broader range of criteria—both positive and negative—consistently outperform those limited to essential checks, suggesting that richer evaluation signals lead to better learning.

As you might have guessed, the solution was more LLM: use a model to generate prompt-specific rubrics:

For each domain, the prompt (included in Appendix H) instructs the LLM to generate 7–20 rubric items based on the complexity of the input question. Each item is assigned a categorical weight (e.g., Essential Criteria, Important Criteria) to determine its importance to a correct answer. The rubrics are designed to be fully self-contained which means that non-expert readers should be able to evaluate response quality using only the rubric.

This particularly benefited from having a reference answer attached to the prompt. The models do a much better job of coming up with a good rubric if provided with a (human generated) “good” answer to judge against rather than just the question/prompt. This really opens the door to 1:1 rubrics: given questions and reference answers, you can generate a scoring checklist for each one and mix it with verifiable rewards during post-training.

The field continues to be turtles all the way down: using LLMs to write rubrics to have LLM judges evaluate LLM training outputs. At some point, someone’s going to suggest we use rubrics to evaluate how good our rubrics are, and honestly, I’m surprised that paper doesn’t already exist⁶.
1. Correct in predicting the next token, and helpful, honest and harmless, specifically. ↩︎
2. With humans still looped in to validate that the ratings were reasonable. The human graders went from generating ratings to rating the raters. ↩︎
3. This is the part where everyone pretends they know exactly how O1 works, but actually we’re all just pattern-matching from breadcrumbs ↩︎
4. Else we’d risk giving more focus to problems with more rubrics, and end up with something unthinkable like a coding model that liberally sprinkles emojis everywhere ↩︎
5. In practice, they also tried a single LLM judge that took in all criteria and weights and generated a scalar reward, which seemed to work fine. ↩︎
6. It probably does, I’m just scared to look ↩︎
August 12, 2025

Modelling, note to self
Constraints & Orchestrators
I recently read a few posts that helped connect the dots on why Python is a) so successful as the lingua franca of ML b) also seems likely to be successful in the future¹.

ML code reads like one program, but runs many: CUDA kernels, vectorized CPU loops, graph compilers and a bunch of glue code moving data around and tying things together. Python has continually improved at balancing two somewhat competing challenges: constraining the hot path so compilers can optimize it and structuring an orchestration path so humans can reason about it.

Hot Path

constrained languages are easier to optimize by Jynn Nelson touches on this:

we should not be asking “what language can i use everywhere for every purpose”; we should build meta-languages that allow you to easily use the right tool for the job. this is already true for regular expressions and query languages; let’s go further. i want inline futhark; inline CSS selectors; inline datalog; ffi between python and C that’s trivially easy. the easier we make it to interop, the easier it becomes to pick the right tool for the job.

Compilers are generally going to perform better if you have regular shapes, minimal side effects, predictable memory access and so on, but you want languages to be expressive and flexible, particularly when “research” is a big part of the work. In practice, that’s precisely what happens with ML : torch.compile lowers PyTorch graphs to an IR and (often) emits Triton kernels. Being able to hand off inner-loops to specialized languages allows compilers and runtimes to optimize and target the use cases they are best at.

While this is (somewhat) clear for GPUs or other accelerators with distinctive programming models, I think it’s also largely true for getting the best out of modern CPUs. Daniel Lemire’s SEA 2025 talk covers nearly a decade of performance work and sums it up: modern CPUs do nearly as many instructions per cycle as you can feed them. To really maximize performance you need to batch work, reduce instruction counts and vectorize. We can do some of that in the general Python² runtime but dynamic dispatch, aliasing and side effects all make the job a lot harder. We can add speculative guards, which can be hard to reason about, or give up and lose performance. By having DSLs³ that add additional constraints we can give ourselves the ability to get much, much higher performance without scrificing the overall flow of our program.

Orchestration Path

Python is unusually good as an orchestrator. From a readability perspective the language is baseline very readable and as long as libraries and DSLs stay Pythonic they tend to inherit that intelligibility. The challenge with orchestration is coordinating work in such a way that your most precious resources are well utilized. The investments in Free-Threaded Python make it a lot cheaper to do concurrency, but they don’t magically fix the challenge of coordination.

asyncio: a library with too many sharp corners covers some of the many failure modes the community have encountered with asyncio, and makes a case for Trio or ANyIO style structured concurrency that allows for manageable failure modes.

asyncio is not a good library. It is constantly full of sharp edges everywhere with implementation details leaking and poorly designed APIs forcing end users into odd code patterns to avoid fundamental flaws in the interfaces.

This is very much a readability version of the constraints concern on the hot path. Threads are a bad app abstraction over shared mutable state, reasoning about races and cancellation is hard, and primitives are always leaky. But threads are a perfectly fine implementation detail behind a more constrained API, like task groups, or actors, or so on.

One area that I do think needs sustained improvement is how we debug and trace across this kind of set up: it’s been challenging even in a controlled environment to really understand how all the pieces interact in a reasonably scaled ML workload, and I imagine that problem will only get worse. But I also expect that the flexibility and breadth of Python will end up a boon there as well.
1. Beyond just sheer momentum, of course. ↩︎
2. Or any language! Certainly for some optimizations having a JIT for Python would (and does) make life easier. ↩︎
3. Whether that is an embedded JIT like Triton or a library+execution engine like Polars. ↩︎
August 6, 2025

ML Infrastructure
Overthinking Everything
Yann was taking laps on Threads a few weeks back over a recent paper, which was one of several recently that have explored aspects of how autoregressive models do as the amount of information they are dealing with gets longer. His general complaint is that each token they generate can either push them towards the right answer or further away from it, and that the models are inherently bad at recovering if they end up too far outside the correct trajectory.

This “more might be worse” idea shows up anywhere folks are leveraging large context windows, and one of those¹ is in agentic tasks. This post summarizes some research trying to measure the fall-off in chances of succeeding as task length² increases.

Is there a Half-Life for the Success Rates of AI Agents? — Toby Ord

It provides indirect evidence that what really is going on under the hood is that tasks are made up of many sequential subtasks and the chance of succeeding at the whole requires succeeding at every individual component. Moreover, this suggests that the current AI agents are not very good at recovering from earlier mistakes.

The framing they use is a constant hazard rate: each subtask is another roll of the dice, and if you roll a failure you don’t have much chance of recovering. So more (or longer) is pretty much always worse.

One interesting aspect is that they also investigate the human failure rate, which increases over time, but much more slowly:

This could indicate a different scaling behaviour of success rate with time horizon for humans compared to AI agents, which would be well worth investigating and may suggest important underlying mechanisms (e.g. that the humans were better at correcting earlier failed subtasks). If human performance scales differently with task length than AI agent performance, that would be an important result, suggesting that there is a notable inefficiency in the current AI paradigm.

They’re testing with multiple runs, so these aren’t just models hitting problems they can’t do: its models hitting problems they can’t do given the specific tokens they have generated tried before.

Agentic use cases aren’t the only situation where a model is generating responses that add to its context window. There were a lot of early observations after the release of O1 last year that thinking for longer on easy problems does not add value. This recent paper proposes not only that but suggests that there is an inverse scaling law: more time thinking makes the model worse.

[2507.14417] Inverse Scaling in Test-Time Compute

More specifically, they devised some stress tests: things like counting problems in the presence of distracting information, performing a regression where there is easy-to-understand but spurious data, and so on. The performance drops as the trace length increases. Different models are more susceptible to some failure modes than other, but performance consistently drops:

Our experiments reveal distinct failure modes across model families—Claude models are particularly vulnerable to distraction from irrelevant information, while OpenAI o-series models show greater resistance but overfit to familiar problem framings. Extended reasoning amplifies different weaknesses: models overthink simple problems, shift attention to spurious correlations, and lose focus during Deduction tasks with constraint tracking.

In contrast, Chroma’s recent Technical Report investigates how models do on single prompts, but of increasingly long contexts.

Context Rot: How Increasing Input Tokens Impacts LLM Performance | Chroma Research

Unlike in the agentic case, here all of the context is passed in at once, so the model isn’t poisoning its own context window through bad choices. It is still dealing with a large amount of content where it needs to choose which parts to attend to. Traditionally the test of long context has been needle-in-a-haystack evaluations: a relevant fact is hidden at different points in a long prompt and the test evaluates whether the model can effectively pull it out.

The Chroma folks make the test a lot more nuanced — adding distractors³ and irrelevant content in both the broader context and the question. They find that performance consistently degrades as context increases.

More broadly, our findings point to the importance of context engineering: the careful construction and management of a model’s context window. Where and how information is presented in a model’s context strongly influences task performance

All of these papers rhyme with LeCun’s gripe about autoregressive transformers, which is (roughly!) that they (also) have a constant hazard rate on generating the “right” token.

This is a very active area of research though. Process-based rewards in RL training make updates on each step vs only at the end. Multi-token prediction reduces the effective generation length or number of chances of misprediction. Summarizing context effectively compresses existing tokens, also reducing error rate.

Similarly, if you have good verifiers⁴ you can use beam or tree searches to explore multiple reasoning paths during generation , which can reduce the error rate, at the cost of more compute.

The closest (LLMish) techniques to LeCun’s vision are things like the recent Hierarchical Reasoning Model that has a layer of persisting hidden state, but it’s still pretty experimental!

As agentic and reasoning traces get longer, I’m sure we’ll see more entries documenting failure modes, and proposals for techniques to scale around them.
1. And the one being referenced in the post! ↩︎
2. In time — they characterize tasks based on how long it takes humans to do them, which is a good control factor ↩︎
3. As in additional content related to the question, but that doesn’t give the answer. ↩︎
4. Similar to process-based rewards this is somewhat pushing the problem to the ability to judge how well you are doing during the generation ↩︎
August 4, 2025

Modelling
The Tools Are Made Up
It has been hard to keep up with the flurry of strong agentic open-source models coming out of Chinese labs recently, including Moonshot’s Kimi K2, Z.ai’s GLM 4.5, and Qwen3-Coder¹.

Each of them have the mix of clever pre-training recipes and verifiable-rewards post-training. Notably, Kimi and GLM both use the Muon optimizer, which seems to be gaining ground among the OSS labs at least. GLM’s description of the recipe is as follows:

Our base model undergoes several training stages. During pre-training, the model is first trained on 15T tokens of a general pre-training corpus, followed by 7T tokens of a code & reasoning corpus. After pre-training, we introduce additional stages to further enhance the model’s performance on key downstream domains. Unlike the earlier pre-training stage on large-scale universal documents, these stages leverage medium-sized domain-specific datasets, including instruction data.

The additional stages, which they refer to as mid-training, extend the context window and help grow capabilities in specific domains. They then move to post-training, with SFT over reasoning and agentic traces followed by RL with Verified Rewards².

The Kimi-K2 technical report goes into more details about how to actually train for tool use. Unlike the others, Kimi is not a reasoning model so doesn’t use much in the way of extended thinking. The fact that wasn’t required to get to strong levels of tool use/agentic capability feels pretty notable to me — most of the recent³ agentic models have been built on a reasoning foundation.

What I really found interesting from the Kimi report was the level of synthetic data that the team used. This starts in pretraining: to extend high quality data sources they rewrite it with another LLM, giving the same facts with new phrasing, instead of looping over the same “good” data for multiple epochs.

Their approach to tool training takes this kind of idea ever further:

We construct a comprehensive tool repository through two complementary approaches. First, we directly fetch over 3,000 real MCP (Model Context Protocol) tools from GitHub repositories, leveraging existing high-quality tool specifications. Second, we systematically evolve 82 synthetic tools through a hierarchical domain generation process. We begin with key categories (e.g., financial trading, software applications, robot control), then evolve multiple specific application domains within each category. Specialized tools are then synthesized for each domain, with clear interfaces, descriptions, and operational semantics. This evolution process produces over 20,000 synthetic tools.

They analyze a set of real tools, generate some novel (but derivative) ones, then domain-specialize them for a lot of use cases.

Once they have this tool zoo, the actual training loop involves:
1. Randomly sample a subset of tools and give it to a new agent with a fresh system prompt. Generate tool-appropriate tasks with explicit success rubrics.
2. Run an LLM-driven user simulator to drive the agent, while running the tools in sandbox that keeps state.
3. Filter trajectories using another LLM as judge to keep only successful ones for SFT
They’re using models at every stage to generate data and evaluate options. When it comes to the actual RL training, they are baselining in verifiable rewards wherever possible for the RL stages: They, and the Qwen folks, talk about their simulator set up for code⁴: thousands of sandbox environments.

For software engineering tasks, we collect a vast amount of pull requests and issues from GitHub to build software
development environment that consists of user prompts/issues and executable unit tests. This environment was built on a robust sandbox infrastructure, powered by Kubernetes for scalability and security. It supports over 10,000 concurrent sandbox instances with stable performance, making it ideal for both competitive coding and software engineering tasks

The combination of very sophisticated synthetic data and operationally intense sandboxes seem like table stakes for the current agentic game, and one which a lot of labs have figured out. Feels very promising for a growth in capabilities of these models over time, particularly as we work out how best to distill them down to smaller sizes for inference.
1. Which seems a very solid model, but they haven’t released a lot of extra details about how they got there. One interesting component of the release though was that they forked Gemini CLI to make a qwen-code tool that works with any OpenAI compatible API, and I had some success locally plugging it into the smaller Qwen3 (non-coder) releases in case you were looking for some offline agentic capabilities! ↩︎
2. Then GLM is distilled between the RL and base version of the model, which apparently helps generalize. This seems like a fun and relatively simple way of smoothing out the learning. ↩︎
3. Though Claude 3.5 wasn’t, and that is really the trend-setter here I guess! ↩︎
4. And other tasks that allow fully verifiable rewards. They use other models to score softer domains like creative writing. ↩︎
July 30, 2025

Modelling
PyTorch Conference 2025
The schedule is up for the 2025 edition of the PyTorch conference, which is now at the Moscone West in San Francisco.

https://events.linuxfoundation.org/pytorch-conference/program/schedule/

There are a lot of great sessions, but I’ll highlight some I personally find particularly interesting:

Post-Training: Clearly a big theme this year, with some interesting talks from multiple groups:
- PyTorch Conference 2025: Verl: A Flexible and Efficient RL Framew… – The Bytedance seed team are doing some great work, and Verl seems like a strong post-training option
- PyTorch Conference 2025: Post-training at Scale in Native PyTorch… – Evan introduces the new torch package to replace Torchtune, designed for (scaled) RL post-training
- PyTorch Conference 2025: Maximizing Luck in Reinforcement Learnin… – Daniel from Unsloth always delivers a great talk
- PyTorch Conference 2025: An Open Source Post-Training Stack: Kube… – Anyscale’s ideas on a post-training stack
- PyTorch Conference 2025: Lifecycle of a Parameter – Philip Bontra… – Philip works on Torchtune and the new solution mentioned above, this is a “life of a parameter” lightning talk through various transformations.
General Training
- PyTorch Conference 2025: Monarch: A Distributed Execution Engine… – I’m pretty convinced single controller is the direction we all will go in, and Monarch is a great place to start.
- PyTorch Conference 2025: Efficient MoE Pre-training at Scale on A… – A walk through of training on AMD at a decent level of scale
- PyTorch Conference 2025: MX Training To Inference on B200 Using T… – MX number formats allow packing even more parameters into a few bits. Blackwell has support, so it’s interesting to see how well it performs.
- PyTorch Conference 2025: PyTorch APIs for High Performance MoE Tr… – Some of the best names in PyTorch performance tackling MoE training.
Kernel development
- PyTorch Conference 2025: The Future Is Tiled: Using CuTile and Ti… – Cuda is thread-based, but a lot of the successful kernel development languages (like Triton!) are block or tile based. TileIR is a low level representation of that kind of tile based programming which Nvidia teased back in March.
- PyTorch Conference 2025: PyTorch Symmetric Memory: A New Programm… – Symmetric memory allows kicking off comms without going back to the host, so you can fuse more stuff!
- PyTorch Conference 2025: Mojo + PyTorch: A Simpler, Faster Path T… – Mojo kind of fell off my radar, but using it as a kernel language is a super interesting idea!
- PyTorch Conference 2025: Helion: A High-level DSL for Kernel Auth… – My general take on kernel development is that we are expanding the stack both up and down from Triton to try and give the best range development vs performance tradeoffs. Helion is a super interesting approach at going up: bridging a gap between PyTorch and Triton.
Compilers
- PyTorch Conference 2025: Lightning Talk: Dynamic Shape Recompilat… – Bob Ren fills in for Ed Yang’s classics with a talk about recompilations and dynamic shapes.
- PyTorch Conference 2025: Thunder: Distribute and Optimize Your Py… – A talk on Lightning AIs PyTorch compiler
Inference
- PyTorch Conference 2025: vLLM: Easy, Fast, and Cheap LLM Serving… – VLLM is the most important project in inference right now, so this should be interesting!
- PyTorch Conference 2025: Everything Everywhere All at Once: Hardw… – Shockingly the Google talk includes TPUs, but I think this trend (heterogenous serving) is pretty broadly applicable as we try and optimize our wattage!
- PyTorch Conference 2025: ExecuTorch 1.0: General Availability Sta… – Edge is inference too, and I’m glad to see ExecuTorch make it to GA
I’m looking forward to October!
July 25, 2025

ML Infrastructure
RL in the second half
The Second Half – Shunyu Yao – 姚顺雨

Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”
To recap the game of the first half:
- We develop novel training methods or models that hillclimb benchmarks.
- We create harder benchmarks and continue the loop.
This game is being ruined because:

Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.
The post makes the point that the gap is benchmarks that more closely map to real-world problems.
when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is
- We develop novel evaluation setups or tasks for real-world utility.
- We solve them with the recipe or augment the recipe with novel components. Continue the loop.
Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks¹ have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation². RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation³ towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!
1. Jack Clark particularly ↩︎
2. Receptive vocabulary vs. productive vocabulary. ↩︎
3. A universal geometry of a vector field? ↩︎
July 22, 2025

links-and-recs
Quack CuteDSL Kernels

Dao-AILab/quack: A Quirky Assortment of CuTe Kernels

Tri Dao & co have a fun repo up called Quack: A Quirky Assortment of CuTe Kernels, all leveraging the CuTe-DSL. These are hopper and blackwell oriented kernels for a variety of common needs like softmax, layernorm and RMSNorm.

On top of that, they wrote a post on how to get speed of light (memory bound) kernels in CuTe-DSL. It goes through how to implement a reduction op across multiple tiers of memory using TensorSSA for thread level reductions, warp reduction with shuffle_sync_bfly and block reduction with shared memory. Even if you’re not writing CuTe, this is about as good an introduction to architecting memory bound ops as I have seen!

They also cover clustered reduction, leveraging multiple SMs:

In cluster reduction, we first send the current warp’s reduced value to all the peer thread block’s reduction buffer in peer’s SMEM. Such sending is conducted via a dedicated SM-to-SM fabric (as DSMEM). Then each warp fetches all warp’s values from their local reduction buffer, and reduces these values.

This does seem to help the kernels scale well to larger sizes:

We believe our outstanding performance at >= 65k input is due to our successful utilization of cluster reduction in H100. When the size of inputs are ultra long and depleting the SM’s registers and shared memory, without cluster reduction, we would have to switch to an online algorithm (like online softmax) otherwise we may get a massive register spilling that leads to significant throughput degradation.

I also really appreciate this note of reality in their conclusion:

Hitting “speed-of-light” model memory throughput confirms that a carefully hand-crafted CuTe kernel can squeeze every byte across all memory hierarchies in the hardware. But that efficiency comes at the price of per-operator and even per input-shape tuning, which imposes a natural tradeoff between efficiency and development efforts

July 18, 2025

links-and-recs
Reinforcement Learning Continues To Be The Frontier
Back in 2021, OpenAI nixed its robotics team, leading to comments on Hacker News like “Reinforcement learning itself is a dead-end on a road to AI”. Now, in 2025 we are surrounded by RL post-trained reasoning models and Mary Meeker is using the word “unprecedented” a lot. This kind of skepticism/hype overlap is very common right now, as Helen Toner breaks down in her excellent recent post/talk on unresolved questions in AI:

Last year, we had coverage from the Wall Street Journal—really good reporting—about real challenges inside OpenAI with scaling up their pre-trained models and how difficult that was and how they weren’t happy with the results, and then on the literal same day we had the release of o3, the next generation of their reasoning model, and François Chollet—who’s famously skeptical—saying that it was a significant breakthrough on his ARC-AGI benchmark. So these very contradictory takes, both of which had some truth to them.

The framing used in that post is really useful: it’s less about “are we making progress?” and more “are we on the right branch of the tech tree?”

A lot of people thought RL was the wrong branch: after notable successes from DeepMind and OpenAI, RL had become a bit of a backwater, with some resurgence (in a limited form) from Reinforcement Learning with Human Feedback (RLHF) for preference tuning LLMs.

The reason people keep coming back to reinforcement learning is the ability to discover new things. Supervised learning is somewhat inherently bound by the dataset. A reinforcement process can continue to explore and find new strategies, like the famous examples of AlphaGo choosing moves humans wouldn’t have. Tim Lee has an excellent non-technical introduction to the evolution of RL that mentions this: Reinforcement Learning Explained

In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations

In this direction, a recent paper, [2507.00432] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, suggests¹ that reasoning generalizes better from RL-driven learning than supervised fine-tuning.

RL-tuned models achieve significant gains on math reasoning while preserving positive transfer to other reasoning tasks and non-reasoning tasks, whereas SFT often incurs negative transfer on non-reasoning benchmarks. Second, PCA analysis of latent space confirms that RL induces minimal drift from backbone representations thus maintaining feature stability, while SFT produces larger latent shifts, especially in non-reasoning domains. Third, token-distribution analysis shows that RL selectively adjusts only a handful of task-relevant tokens, whereas SFT perturbs many irrelevant tokens, indicating RL’s more targeted optimization.

RLHF is implemented by first training a reward model based on human preference feedback: you give people two versions of an answer, they tell you which one they prefer, you then train a model to predict those ratings. That reward model becomes the scoring function during post-training.

Designing good reward functions has been somewhat of a dark art in RL. The agent optimizes what you ask for, which is not always what you really want². This “reward hacking” phenomenon makes RL agents somewhat brittle, prone to exploiting loopholes in environments in ways no one anticipated.

The recent reasoning models did so well because their rewards were verifiable: reward scores that are based on some ground truth validation and are often just yes/no: does code compile, does it pass a unit test, can a math proof be verified by a formal logic reasoner, or simply is the answer correct or not. Nathan Lambert did a breakdown on where RL goes next:

The optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer.

Lambert makes the point that even the very long-range tasks we have now (coding agents, deep research) are based around learning to be better at tasks individually, then stringing those together:

How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.

This problem of having to learn to act over a long time-horizon is a recurring one in RL. The best algorithms we have for reinforcement learning are online: the model learns “live” while interacting with the environment. But sometimes it’s a lot easier to collect data than it is to run an experiment: for example, it’s much safer to get a large amount of sensor input from driving cars around than it is to have a model driving a real car around and making mistakes. This is off-policy or offline RL, and it offers the promise of learning from much larger data sets.

Seohong Park recently wrote a great post breaking down how offline RL fails to scale up: Q-Learning Is Not Yet Scalable³. In the experiment there the team at Berkeley generate 1000x more data to try and scale offline RL, and still see the process breaking down:

Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).

Noted LLM-branch skeptic (and technically a very distant colleague) Yann LeCun has spoken a lot about a version of this kind of planning and world modelling problem, which he sees as inherent to the autoregressive nature of LLMs: the accumulation of errors over long time horizons.

One of his architectural bets is JEPA, and the recently released V-JEPA 2 paper is beginning to show how this could work. V-JEPA 2 is a self-supervised video world model trained on a million hours of YouTube video. The model learns in a semi-supervised fashion by masking out parts of video frames and predicting them, in latent (embedding) space rather than pixel space. After the pre-training, they freeze the encoder, generate tokens with it for a video and prepend those to a query for a pretrained LLM⁴ .They fine-tune that LLM on video question answering data, and were able to get state of the art question answering with that set up, despite the JEPA part of it being totally task agnostic.

Going a step further, they took the encoder and hooked it up to a small robot control model⁵. They trained it on some robot operator data for pick-and-place tasks. It learned to do a remarkably good job, without any reinforcement learning at all!

This is interesting because robotics has traditionally been an area where we have seen a lot of exploration (with success and disappointment!) with long-range RL. Andrew Stokols’ excellent post on ChinaTalk makes a good case that while the west has focused on AI in a brain-in-a-jar type way, there has been a concerted push in Beijing for Embodied AI (with Chinese Characteristics). China has a very strong base in manufacturing. Robotics, drones, autonomous vehicles are all being developed and deployed in the country.

One of the fundamental challenges robotics systems have to address is much more constrained latency bounds: the world operates in real time, and running a big model may result in a smart robot that simply cannot respond quickly enough to be useful. The space has trended towards hierarchical models, which chunk actions into higher level concepts that a controller model puts out (like “pick up at x”) and lower-level models that decode those chunked outputs into a series of motor commands. While sometimes transformers are used autoregressively (take sequences of state, action and predict next action), many now use diffusion-based techniques where they will generate a whole trajectory at once. Physical Intelligence recently put out a paper on Real Time Chunking where they show you can start with generating a chunk, then continue the denoising process a-la inpainting or fill-in-the-middle to generate the steps between the start and goal, allowing more real time responses.

China putting a lot of eggs in the embodied AI basket is indirectly also betting that methods to make those systems learn and adapt will mature. Some of those same techniques will invariably apply to the (disembodied) agents that are currently the focus on big labs in the west.
1. One of the ways they corroborate this finding is by seeing there is less KL divergence in the RL trained model than the SFT model, but that’s usually a training objective on RL, and I’d imagine you could apply KL regularization to SFT as well if you wanted. ↩︎
2. A classic example from OpenAI: A reinforcement learning agent in a boat race game was given points for hitting targets, so it happily learned to drive in circles hitting the same targets forever, instead of actually finishing the race. Faulty reward functions in the wild | OpenAI ↩︎
3. Q-Learning is the most common class of algorithms for offline RL. ↩︎
4. They unsquash it into the hidden dimension size, and depending on how the numbers work out add some pooling. ↩︎
5. Much like with the LLM, they combine the video embeddings with model-specific tokens, in this case a state tracking input and the current state of the robot arm. ↩︎
July 9, 2025

note to self
ARPA and predicting the future
Statecraft recently re-ran an interview from 2023 with Jason Matheny, formerly of IARPA: https://www.statecraft.pub/p/how-to-predict-the-future-278

While defense policy and research is a ways outside the scope for myself (or I imagine most folks reading), the problems of managing or working on uncertain, research-y projects in a volatile environment are pretty relatable:

Most of what we know from cognitive psychology and human judgment research over the last 50 years suggests that unstructured group deliberation might be one of the worst ways of making judgments, yet it’s the norm in most institutions.

Or this bit of career wisdom:

In general, people underestimate their own potential to make contributions to the most important problems. They overestimate how many people are already working on the most important problems. So many incredibly important problems are just really neglected. If you can’t figure out who’s working on something after a few days of homework, then it is probably a neglected problem. And it’s probably up to us to solve it.

Jason talks about looking for projects in the goldilocks zones of probability (less than 50%, more than 5%) that open up interesting opportunities. I worked with a manager who was a strong advocate of the Heilmeier Catechism to evaluate projects, and have seen the value of using it as guidance when presenting and evaluating ideas:
1. What are you trying to do? Articulate your objectives using absolutely no jargon.
2. How is it done today, and what are the limits of current practice?
3. What is new in your approach and why do you think it will be successful?
4. Who cares? If you are successful, what difference will it make?
5. What are the risks?
6. How much will it cost?
7. How long will it take?
8. What are the mid-term and final “exams” to check for success?
Jason adds some interesting updates:

For instance, the Heilmeier questions don’t have a question about counterfactual impact: “Would this work get done otherwise?” The office tends not to rigorously assess the other funding streams going to solve this particular problem, and their likelihoods of success.

We also tend not to think much about strategic move and countermove. […]. It probably is prudent to assign at least a 10% probability to some exquisite, classified technology being stolen.

One thing I found myself talking about this week with a couple of folks was how good people get “lucky” a lot. I think these kinds of questions help navigate towards those more positive-surprise-filled spaces.
July 8, 2025

links-and-recs

Blog

Loss/Target

Architecture

Back to Qwen

High Road/Low Road

Hot Path

Orchestration Path