Last year Keller Jordan at OpenAI beat some of the existing NanoGPT speedrun records thanks to some optimizer improvements. Towards the end of the year the work was formalized as the Muon optimizer, and it’s making waves in a bunch of areas now

Jeremy Berenstein has written up a great post on how Muon is derived:
To handle individual layers, our idea is to normalize the weight updates in a clever way so that, given the structure of the inputs, the weight updates automatically induce a desirable effect on the outputs. As a community, we have invested so much effort into thinking about how to normalize the activations: think batch norm, layer norm, RMS norm, etc. Why not also consider how the weights and weight updates influence the activations?
Keller also wrote a detailed blog post when introducing the optimizer, calling out some open questions (like does it scale to very large training).
As the posts cover, the optimizer isn’t totally generally – it was designed for linear layers (and flattened convs), so you need to pair it up with Adam for most usage.
You can install the library from Github: pip install git+https://github.com/KellerJordan/Muon
from muon import Muon
muon_params = [p for p in list(model.parameters()) if p.ndim > 2]
muon_param_ids = {id(p) for p in muon_params}
adamw_params = [p for p in model.parameters() if id(p) not in muon_param_ids]
# Create the optimizer
optimizers = [Muon(muon_params, lr=0.001),
torch.optim.AdamW(adamw_params, lr=0.001)]
And step both optimizers in the training loop:
for opt in optimizers:
opt.step()
It’s great to have innovation in this area, particularly with this kind of fundamental reasoning around why it works!