RL in the second half

The Second Half – Shunyu Yao – 姚顺雨

Extremely interesting post by Shunyu Yao of ReAct and Tree of Thought fame about where we got to with AI and where we are going. Read it for the spot-on description of the weirdnes of reasoning as an RL concept, but my main takeaway was the refinement to the idea that the most important thing is having models that “do the benchmarks”

To recap the game of the first half:

We develop novel training methods or models that hillclimb benchmarks.

We create harder benchmarks and continue the loop.

This game is being ruined because:

Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. […]

The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.

The post makes the point that the gap is benchmarks that more closely map to real-world problems.

when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

We develop novel evaluation setups or tasks for real-world utility.

We solve them with the recipe or augment the recipe with novel components. Continue the loop.

Shunyu works on computer use at OpenAI, so this is well within his wheelhouse, and I think it’s a compelling claim. Many folks¹ have talked about the capability overhang LLMs: there is a large baseline ability to do things in the models, but eliciting that ability can be challenging. I tend to think of that similarly to how that are many words which we can comfortably understand, but are very unlikely to use ourselves in conversation². RL is our most powerful tool for eliciting capabilities, but it’s a somewhat blunt instrument. Having the right evals, eval environment and tasks helps train the agent to interact in a way which generalizes.

I wonder if, as we progress through this second phase, that we will find signs of a similar “universal geometry” as has been suggested for pretraining in this elicitation: perhaps there is eventually a universal navigation³ towards where to generate in that space for different tasks. Maybe that’s what we’ll call AGI!

Jack Clark particularly ↩︎
Receptive vocabulary vs. productive vocabulary. ↩︎
A universal geometry of a vector field? ↩︎

RL in the second half

More posts

Perplexed

The model can probably write the code

TileIR

What is In-Distribution

RL in the second half

More posts

Perplexed

The model can probably write the code

TileIR

What is In-Distribution

Discover more from Ian’s Blog