https://arxiv.org/abs/2502.03373
Very clear paper on how RL and SFT combine to elicit reasoning capabilities, with some practical takeaways:
- SFT on long chains of thoughts are much better than short
- Reward shaping is important for getting stable scaling
- Reasoning approaches (like backtracking) are probably present in the base model, if it’s big enough
Based on these observations, we hypothesize that RL primarily guides the model to recombine skills it already internalized during pre-training towards new behaviors to improve performance on complex problem-solving tasks.