Training Language Models to Self-Correct via Reinforcement Learning – Aviral Kumar, Vincent Zhuang et al at Deepmind
Observes that taking traces (any method) of LLM output, reflection and corrected output and using that for SFT tends to result in the model learning a minimal edit strategy that does either no self correction or minimal edits to the original answer.
We find that such approaches fall prey to either: (1) distribution shift, where the trained model is able to correct errors made by the base model that generated the data, but these gains do not transfer to self-correction under the learned model’s own mistakes; or (2) behavior collapse, where the learning progress simply learns to produce the best first-attempt response followed by superficial or no modifications in the second attempt.
To avoid that, they use a 2 stage RL process where they constrain the first to the same distribution as the base model, then maximize the reward on the second.