https://nonint.com/2024/03/03/learned-structures/
As time has passed, I’ve internally converged on the understanding that there are only a few types of architectural tweaks that actually meaningfully impact performance across model scales. These tweaks seem to fall into one of two categories: modifications that improve numerical stability during training, and modifications that enhance the expressiveness of a model in learnable ways.
I think this is a super interesting post, as the mental model it gives of architectures is helpful. The idea that the model learns “in stages” is very intuitive, and is easy to see in some architectures (like convnets!)