Very useful insight in this paper out of Stanford.
Test-time inference has emerged as a powerful paradigm for enabling language models to “think” longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau.
The authors were running a reasoning post-training process on both Qwen 2.5 3B and Llama 3.2 3B. They noticed that while both learned, Llama was consistently worse than Qwen, which feels odd as both models are strong. In looking at the reasoning approaches exhibited they observed 4 distinct reasoning strategies:
- verification
- backtracking
- subgoal setting
- backward chaining
They noticed that Qwen exhibited these behaviors more from the base model, and those behaviors were enhanced more in the RL process.
While the larger Llama-3.1-70B showed generally increased activation of these behaviors compared to Llama-3.2-3B, this improvement was notably uneven — backtracking, in particular, remained limited even in the larger model.
They then generated some custom reasoning traces that intentionally demonstrated all 4 behaviors, using Claude.
We generate these datasets using Claude-3.5-Sonnet4, leveraging its ability to produce reasoning trajectories with precisely specified behavioral characteristics. While Claude does not always produce the correct answer (see Fig. 9), it consistently demonstrates the requested reasoning patterns, providing clean behavioral primitives for our analysis.
They found that when using the SFT set before RL they closed most of the gap between Llama and Qwen. They also found that it isn’t even important that the reasoning traces are correct – demonstrating the behavior is more important than the reasoning itself at the SFT stage.
Priming models with cognitive behaviors, by a small amount of finetuning, enabled significant performance gains even in models that initially lack these capabilities. Remarkably, this holds even when primed with incorrect solutions that exhibit the target behavioral patterns, suggesting that cognitive behaviors matter more than solution accuracy.
This fits with the elicitation idea — the SFT is training a style. By having increased activation of the reasoning styles the RL process is more able to explore these capabilities and reinforce extended reasoning generation.
This also fits with my mental model that a base model’s capabilities are often pretty underexplored: a combination of targeted SFT + RL seems to be a very powerful elicitation tool!