Skip to content

About

Demystifying Long Chain-of-Thought Reasoning in LLMs

Written by

in

https://arxiv.org/abs/2502.03373

Very clear paper on how RL and SFT combine to elicit reasoning capabilities, with some practical takeaways:

SFT on long chains of thoughts are much better than short
Reward shaping is important for getting stable scaling
Reasoning approaches (like backtracking) are probably present in the base model, if it’s big enough

Based on these observations, we hypothesize that RL primarily guides the model to recombine skills it already internalized during pre-training towards new behaviors to improve performance on complex problem-solving tasks.

llm post-training

←Byte-Latent Transformers

DeepSeek’s Flash MLA Kernel→

More posts

We can distill it for you wholesale

May 31, 2026
Maybe the agents shouldn’t write the kernels

May 27, 2026
The elusive order of things

May 25, 2026
Loss Exploded.

April 27, 2026

Twenty Twenty-Five

Designed with WordPress

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Type your email…

Continue reading