Demystifying Long Chain-of-Thought Reasoning in LLMs

https://arxiv.org/abs/2502.03373

Very clear paper on how RL and SFT combine to elicit reasoning capabilities, with some practical takeaways:

  • SFT on long chains of thoughts are much better than short
  • Reward shaping is important for getting stable scaling
  • Reasoning approaches (like backtracking) are probably present in the base model, if it’s big enough

Based on these observations, we hypothesize that RL primarily guides the model to recombine skills it already internalized during pre-training towards new behaviors to improve performance on complex problem-solving tasks.

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading