When I was writing recently about MoEs I was focused mostly on the architectural reasons that we use them. One thing I hadn’t considered is that they might actually be better at learning as well.
Meanwhile, Deconstructing Pre-training: Knowledge Attribution Analysis in MoE and Dense Models
Our findings reveal that MoE architectures form a low entropy backbone of consistently reinforced neurons, which leads to an early consolidation of their importance profiles and, in turn, underpins their functional robustness. This resilience, stemming from more distributed knowledge storage, contrasts with the greater brittleness and knowledge concentration in the dense model. These phenomena collectively demonstrate that architectural sparsity is not merely a computational shortcut but also acts as a useful inductive bias that fosters stable and robust learning
To land that somewhere between academic prose and GPT-speak1 the results of the paper are suggesting that MoEs learn more effectively, and store their core knowledge more robustly.
They measure this with Log-Probability Increase (LPI), which lets you estimate how much each column in the output projection for a layer in the model contributes to the final score. It gives you a sense of how much smarter the model gets from that specific chunk of the weights2. They track this “neuron importance” measure over multiple checkpoints using the (very!) open models from AI2, OLMo-7B and OLMoE-1B-7B.
In the MoE the set of important weights is both more stable and stabilizes earlier in training: the model develops a core of understanding and builds on that. This might mean MoE training is genuinely more effective than dense. The dense model is regularly thrashing its core understanding as updates come in, while the MoE protects it and lets the model focus more on nuance.
Or! It might be entirely an artifact of model differences. As the authors note the two models are quite different: different training data sets, different lengths of training, and different depths (16 vs 32 layers), as well as, you know, being an MoE or not. Finally, the actual LPI version they use3, Gated-LPI, bakes in the MoE routing. It’s not totally clear whether we are seeing “neurons that matter”, or mostly seeing “routing patterns that matter”.
I do think4 this is likely showing something interesting, even with some skepticism. The “smearing” of knowledge across weights is how I described what we are trying to avoid with MoEs, and it may be useful to have a more mechanistic understanding of how that actually happens. The authors observe that the stability curve rises, drops and consolidates. Even if this is just an artifact of routing, it’s quite possible there is a critical phase in the training where that routing locks-in.
If that idea is right, we might already be shaping that phase. The load-balancing tricks that made MoEs practical could be doing double duty as scaffolds for learning.