If you want to see what a very painful couple of months looks like for an ML research team, FAIR’s logbook of the OPT-175 pretraining from 2021 should top your list. The first few runs are basically:
- Loss exploded.
- Doesn’t learn.
- Loss exploded.
- Etc.
At each point the team tweaks some of the hyperparameters: learning rates, weight decay, clipping and so on, as well as adjusting how the model is distributed with various parallelisms. They swapped parts of the architecture (GELU to RELU for example), dealt with hardware failures, avoided bad data, and tried to debug what was going on.
That was 2021 though; by now in 2026 we commonly train on tens-of-thousands of much more powerful GPUs. We have a broadly available body of knowledge on how to train massive models. The big labs are now full of serious people debating the moral meaning of perplexity with their in-house philosophers.
But the big labs don’t share those details. Of the the frontier-ish labs DeepSeek continue to be unusually open about their work. Their new one, DeepSeek v4, is pretty great! 1M tokens of context and close to frontier performance across multiple domains. Training, however, was… not smooth:
“Training trillion-parameter MoE models presents significant stability challenges, and DeepSeek-V4 series are no exception. We encountered notable instability challenges during training. While simple rollbacks could temporarily restore the training state, they proved inadequate as a long-term solution because they do not prevent the recurrence of loss spikes.”
One helpful dynamic of model training is that you can often validate what will happen in a big model by training in a small model. But not always. It mostly works for architectural tweaks and allows much more rapid experimentation and testing. But when it doesn’t work you can be in trouble: things that appear to smooth out problems at small scale can mask others at large scale where models have the capacity to learn weirder things. Fixing them late in training, when you are already into the gigawatts, hurts.
The techniques DeepSeek used (expert routing based on stale params, and clipping) did seem to get them through training a massive model on 30-odd trillion tokens, which is an incredible accomplishment. But they are arguably bandaids, as the team readily call out:
“Although a comprehensive theoretical understanding of their underlying mechanisms remains an open question for now, we are sharing them openly to foster further exploration by the community”
Which has echoes of Noam Shazeer’s similar observation for SwiGLU:
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”