https://arxiv.org/abs/2505.02809
Optimizers are consistently one of the great areas of ML for discovering whether you remember any linear algebra or not (I land on not). Given the pace of change, it’s somewhat surprising that Adam(W) has hung around for as long as it has. Adam updates each parameter by a moving average of the gradient, and a moving average of the squared-gradient (the 2nd moment). Each weight is updated (and the moments/running averages are tracked) separately.
One area where we have some alternative approaches are the Shampoo family of optimizers. They take a whole block of parameters (usually the weights of a layer ), stores the moving averages of second moments of all the gradients in a block, then transforms them each step. This transform gives an approximation for the of the inverse Hessian of the block. The Hessian is the square matrix representing the second derivative of the loss: telling you how the gradient slope changes, like a curvature map. This is expensive to calculate for the network as a whole, so Shampoo estimates it one layer at a time (ish – it’ll split up big layers), and rotates/transforms each parameters gradient update based on this curvature estimate.

Empirically either of these work because if you do look at the Hessian of loss in a network, it its basically block diagonal: all of the curvature is within certain blocks, and very little of it is between distant parameters.
The paper, Towards Quantifying the Hessian Structure of Neural Networks looks into why that is. The paper is dense, but they mainly conclude that the block diagonal pattern will occur as the number of output “classes” increase, with blocks representing classes. Given that a class is a token in LLMs, then modern LLMs are strongly likely to exhibit this structure.
(and layers end up somewhat aligning with class-wise blocks in most nets).
In the subsequent analysis, we will show that the number of classes C, instead of the CE loss, is one key factor. Specifically, the near-block-diagonal Hessian structure arises as C→∞ for both the MSE and the CE loss.
[…]
We emphasize that we do not claim “large C” as the only cause for the near-block-diagonal Hessian structure, but just that it is a sufficient condition.
Because output layers generally have a tight class association (e.g. one column per class) this propagates through the model during training. Ideally a block would be “all weights that eventually feed the class”, and the paper shows that training process (on a simple network) pushes different parts of tensors to have stronger associations with specific classes, so you get a kind of local version of the same block diagonal structure.
This kind of understanding is helpful because it helps explain why doing single-parameter optimizing still works (curvature is very localized), and also points a direction for improving optimizer memory usage:
II. Understanding Hessian structure can help design new training methods for NNs. For instance, Adam-mini (Zhang et al., 2024b), a recently proposed optimizer, utilizes the near-block-diagonal Hessian structure to cut down 50% memory consumption in Adam. We believe the special Hessian structure can inspire more new optimizers.
Skimming this paper did make me wonder whether maybe this would also apply to Muon. A new paper, Muon optimizer Accelerates Grokking studies Muon in practice. Grigory Sapnuov wrote up a great summary of the paper, which shows that Muon gets to understanding of the underlying distributions earlier (as shown by increases is eval-set validation).
The authors speculate about what exactly in Muon helps grokking. Spectral norm constraints and second-order cues steer the model away from simple memorization and help discover the true pattern.
Muon operates on 2D tensors only (and is usually mixed with Adam), and uses a transform called Newton-Schulz which takes the directions from the gradient upgrade but makes the singular values of the update equal in each direction, meaning we update the same effective distance based on the local curvature. It also operates one step at a time, rather than storing the moving average. This means it operates a bit like a simplified shampoo, but is even more efficient — so again benefits from the fact that you can largely ignore the geometry outside the layer it’s looking at!

