Reinforcement Learning Continues To Be The Frontier

Back in 2021, OpenAI nixed its robotics team, leading to comments on Hacker News like “Reinforcement learning itself is a dead-end on a road to AI”. Now, in 2025 we are surrounded by RL post-trained reasoning models and Mary Meeker is using the word “unprecedented” a lot. This kind of skepticism/hype overlap is very common right now, as Helen Toner breaks down in her excellent recent post/talk on unresolved questions in AI:

Last year, we had coverage from the Wall Street Journal—really good reporting—about real challenges inside OpenAI with scaling up their pre-trained models and how difficult that was and how they weren’t happy with the results, and then on the literal same day we had the release of o3, the next generation of their reasoning model, and François Chollet—who’s famously skeptical—saying that it was a significant breakthrough on his ARC-AGI benchmark. So these very contradictory takes, both of which had some truth to them.

The framing used in that post is really useful: it’s less about “are we making progress?” and more “are we on the right branch of the tech tree?”

A lot of people thought RL was the wrong branch: after notable successes from DeepMind and OpenAI, RL had become a bit of a backwater, with some resurgence (in a limited form) from Reinforcement Learning with Human Feedback (RLHF) for preference tuning LLMs.

The reason people keep coming back to reinforcement learning is the ability to discover new things. Supervised learning is somewhat inherently bound by the dataset. A reinforcement process can continue to explore and find new strategies, like the famous examples of AlphaGo choosing moves humans wouldn’t have. Tim Lee has an excellent non-technical introduction to the evolution of RL that mentions this: Reinforcement Learning Explained

In short, imitation learning can rapidly teach a model to mimic the behaviors in its training data, but the model will easily get confused in unfamiliar environments. A model trained with reinforcement learning has a better chance of learning general principles that will be relevant in new and unfamiliar situations

In this direction, a recent paper, [2507.00432] Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning, suggests¹ that reasoning generalizes better from RL-driven learning than supervised fine-tuning.

RL-tuned models achieve significant gains on math reasoning while preserving positive transfer to other reasoning tasks and non-reasoning tasks, whereas SFT often incurs negative transfer on non-reasoning benchmarks. Second, PCA analysis of latent space confirms that RL induces minimal drift from backbone representations thus maintaining feature stability, while SFT produces larger latent shifts, especially in non-reasoning domains. Third, token-distribution analysis shows that RL selectively adjusts only a handful of task-relevant tokens, whereas SFT perturbs many irrelevant tokens, indicating RL’s more targeted optimization.

RLHF is implemented by first training a reward model based on human preference feedback: you give people two versions of an answer, they tell you which one they prefer, you then train a model to predict those ratings. That reward model becomes the scoring function during post-training.

Designing good reward functions has been somewhat of a dark art in RL. The agent optimizes what you ask for, which is not always what you really want². This “reward hacking” phenomenon makes RL agents somewhat brittle, prone to exploiting loopholes in environments in ways no one anticipated.

The recent reasoning models did so well because their rewards were verifiable: reward scores that are based on some ground truth validation and are often just yes/no: does code compile, does it pass a unit test, can a math proof be verified by a formal logic reasoner, or simply is the answer correct or not. Nathan Lambert did a breakdown on where RL goes next:

The optimistic case for scaling current reinforcement learning with verifiable rewards (RLVR) techniques to next-generation language models, and maybe AGI or ASI depending on your religion, rests entirely on RL being able to learn on ever harder tasks. Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer.

Lambert makes the point that even the very long-range tasks we have now (coding agents, deep research) are based around learning to be better at tasks individually, then stringing those together:

How to read this training method, which is likely similar for agents like Claude Code or Codex, is that current RL methods are helping the models get more robust at individual tasks that make up a longer trajectory rather than being trained on the end result of the trajectory itself. The final long-horizon behavior is put together with prompting and letting the model run longer, not sparse credit assignment. In the case of Deep Research the final measure of performance would actually look far closer to human preferences than verifiable rewards, and a large portion of that applies for Claude Code as well, where multiple solutions could solve a problem and it falls to human taste to say which is the best.

This problem of having to learn to act over a long time-horizon is a recurring one in RL. The best algorithms we have for reinforcement learning are online: the model learns “live” while interacting with the environment. But sometimes it’s a lot easier to collect data than it is to run an experiment: for example, it’s much safer to get a large amount of sensor input from driving cars around than it is to have a model driving a real car around and making mistakes. This is off-policy or offline RL, and it offers the promise of learning from much larger data sets.

Seohong Park recently wrote a great post breaking down how offline RL fails to scale up: Q-Learning Is Not Yet Scalable³. In the experiment there the team at Berkeley generate 1000x more data to try and scale offline RL, and still see the process breaking down:

Q-learning struggles to scale because the prediction targets are biased, and these biases accumulate over the horizon. The presence of bias accumulation is a fundamental limitation that is unique to Q-learning (TD learning). For example, there are no biases in prediction targets in other scalable objectives (e.g., next-token prediction, denoising diffusion, contrastive learning, etc.) or at least these biases do not accumulate over the horizon (e.g., BYOL, DINO, etc.).

Noted LLM-branch skeptic (and technically a very distant colleague) Yann LeCun has spoken a lot about a version of this kind of planning and world modelling problem, which he sees as inherent to the autoregressive nature of LLMs: the accumulation of errors over long time horizons.

One of his architectural bets is JEPA, and the recently released V-JEPA 2 paper is beginning to show how this could work. V-JEPA 2 is a self-supervised video world model trained on a million hours of YouTube video. The model learns in a semi-supervised fashion by masking out parts of video frames and predicting them, in latent (embedding) space rather than pixel space. After the pre-training, they freeze the encoder, generate tokens with it for a video and prepend those to a query for a pretrained LLM⁴ .They fine-tune that LLM on video question answering data, and were able to get state of the art question answering with that set up, despite the JEPA part of it being totally task agnostic.

Going a step further, they took the encoder and hooked it up to a small robot control model⁵. They trained it on some robot operator data for pick-and-place tasks. It learned to do a remarkably good job, without any reinforcement learning at all!

This is interesting because robotics has traditionally been an area where we have seen a lot of exploration (with success and disappointment!) with long-range RL. Andrew Stokols’ excellent post on ChinaTalk makes a good case that while the west has focused on AI in a brain-in-a-jar type way, there has been a concerted push in Beijing for Embodied AI (with Chinese Characteristics). China has a very strong base in manufacturing. Robotics, drones, autonomous vehicles are all being developed and deployed in the country.

One of the fundamental challenges robotics systems have to address is much more constrained latency bounds: the world operates in real time, and running a big model may result in a smart robot that simply cannot respond quickly enough to be useful. The space has trended towards hierarchical models, which chunk actions into higher level concepts that a controller model puts out (like “pick up at x”) and lower-level models that decode those chunked outputs into a series of motor commands. While sometimes transformers are used autoregressively (take sequences of state, action and predict next action), many now use diffusion-based techniques where they will generate a whole trajectory at once. Physical Intelligence recently put out a paper on Real Time Chunking where they show you can start with generating a chunk, then continue the denoising process a-la inpainting or fill-in-the-middle to generate the steps between the start and goal, allowing more real time responses.

China putting a lot of eggs in the embodied AI basket is indirectly also betting that methods to make those systems learn and adapt will mature. Some of those same techniques will invariably apply to the (disembodied) agents that are currently the focus on big labs in the west.

One of the ways they corroborate this finding is by seeing there is less KL divergence in the RL trained model than the SFT model, but that’s usually a training objective on RL, and I’d imagine you could apply KL regularization to SFT as well if you wanted. ↩︎
A classic example from OpenAI: A reinforcement learning agent in a boat race game was given points for hitting targets, so it happily learned to drive in circles hitting the same targets forever, instead of actually finishing the race. Faulty reward functions in the wild | OpenAI ↩︎
Q-Learning is the most common class of algorithms for offline RL. ↩︎
They unsquash it into the hidden dimension size, and depending on how the numbers work out add some pooling. ↩︎
Much like with the LLM, they combine the video embeddings with model-specific tokens, in this case a state tracking input and the current state of the robot arm. ↩︎

Reinforcement Learning Continues To Be The Frontier

More posts

Somehow, more on distillation

We can distill it for you wholesale

Maybe the agents shouldn’t write the kernels

The elusive order of things

Discover more from Ian’s Blog