Understanding R1-Zero-Like training

Written by

https://arxiv.org/abs/2503.20783

Further support for elicitation and a very good example of why its worth starting with the evals.

They started with evals for mathematical reasoning, and then tested base Deepseek, Qwen and Llama models with different templates to see how they did prior to any RL. In doing so they discovered that Qwen did best with no template, and that Deepseek v3 already would create “aha” moments (reasoning self reflection) without any further tuning. Some of the examples are amusing, particuarly the “awkward silence”:

In Pascal’s Triangle, every row starts
and ends with 1, …
…
This can be calculated as: awkward silence Wait, I’m overthinking. Let’s try again. The number of elements in the first n rows of Pascal’s Triangle…

The second interesting takeaway for me was that the GRPO implementation (along with most PPO implementations) have a length bias: wrong but long answers are preferred over wrong but short ones. Their corrected, unbiased approach resulted in the same performance, but better token efficiency:

We also revisit the GRPO’s optimization bias with the Llama base model. The right plot of Fig. 8 compares the model performance and response length trained with GRPO and Dr. GRPO [Ed: the unbiased version]. We can clearly see that GRPO can produce the “double-increase” phenomenon, potentially leading to a misperception that long-CoT can also emerge on Llama models after math pretraining. Unfortunately, the increase of length might be due to the optimization bias

ml post-training

Understanding R1-Zero-Like training

More posts

It’s always the learning rates

LLMs are complicated now

FactWorld

Somehow, more on distillation

Discover more from Ian’s Blog