Attention, Compression & Predicting the next token

Language modelling is one of the great ideas in ML: if you train a model to accurately predict the next word in a sequence of text1, you are forcing it to learn a deep structure for human language. Because language is how we map reality, hopefully then you can do many useful things. This turned out to be right!

The challenge with actually, you know, doing this is that text is messy. It’s sequential, variable length, and has structure, but the structure is kind of weird: the phrase “the cat, a mellow long-haired persian, sat on the mat” very clearly associates “sat” with “cat”, but the actual words are quite far away2.

Dealing with sequential, variable length data with a fixed network is a bit of an inherent mismatch. In training you often know the sizes you’re dealing with, but at inference time it’s variable. One elegant solution to that was the Recursive Neural Net (RNN): start at the beginning, read one word at a time and keep a “hidden state” as a scratch pad to provide memory of what has come before.

Training RNNs was painful, because now you have to backpropagate over multiple steps, and it was a minefield of vanishing and exploding gradients. The hidden state was used for two different things: the long-term memory of the whole sequence and as the key to the next word.

Getting to Attention

The architecture that really addressed this was the LSTM: instead of a single memory they split short and long-term memory and added activation functions to keep the gradient updates sane. They also made the updating the memory a function of the input rather than of the weights by adding learnable gates that let the model decide which parts of the input to remember, and what information from the memory to forget. This unlocked real sequence-to-sequence models, which proved immediately useful in areas like machine translation: one model reads a sequence and compresses it to a hidden state (the encoder), another generates new output based on it (the decoder).

This solved the training stability bottleneck, and introduced a new one: compression. The entire sequence got compressed to a single hidden state, which limited how much complexity could be captured.

Bahdanau et al. addressed that with the idea of attention in 2014. The hidden state gets updated in the encoder with each new word, so why not keep all the hidden states around? Then, have a small network score which hidden states are relevant to the current decoder state, and make a new contextualized input to the decoder that is a weighted sum of the encoder states. This was called “attention” as it allowed the model to put different amounts of focus on different parts of the input sequence.

The new bottleneck though was throughput: to generate hidden state n, you first needed hidden state n-1. That made it hard to parallelize, which made it hard to take advantage of emerging accelerators. Luong et al first showed that you could simplify the state scoring to make it more hardware friendly, then Attention Is All You Need in 2017 stripped away the recurrent part entirely. In the Transformer architecture they got rid of the RNN and hidden state, replacing it with another version of the attention mechanism: self-attention.

Rather than a stack of hidden states that progressively encode the state of the sequence, each incoming word is transformed at once into a contextualized representation that carries information about it and its surroundings. This was really parallelizable; you don’t need to care about previous time steps to make decisions, so you can scale the computation on GPUs and other accelerators.

In regular attention you can think of the current decoder3 state as a query, and the various encoder hidden states as keys: the scoring function would generate a value for each pair of key and query. In self-attention, all the tokens were projected through key and query networks, and the query for each token was compared to the key of all the others. The transformer also added a value projection: in the older attention the “key” from the hidden state was both “what makes a good match” and “what information the token provides”, but in the transformer the two were decoupled.

The new bottleneck that emerged was performance, particularly during inference. Comparing everything to everything else is an O(n2) operation. During training you can ameliorate some of that through batching, but you’re directly exposed in inference. And, unlike an RNN, increasing the sequence length (aka context length) gives you a quadratic increase in time, not linear.

There were various attempts at addressing this one too. In “Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention” back in 2020, Katharopoulos et al showed that the quadratic aspect of self-attention comes from having to materialize a big matrix to calculate the softmax for scoring. If you replace the softmax with a map-type function you can chunk the computation and get linear time performance. This was mathematically elegant, but didn’t actually work very well, so more engineering-oriented approaches like KV caching and FlashAttention were the main-stay for tackling the bottleneck.

So why talk about this now? Because of Moonshot AI, and their excellent Kimi models. Moonshot are perhaps the frontier-est of the Chinese tiger labs, and their recent model releases have involved: Kimi Linear: An Expressive, Efficient Attention Architecture

The architecture mixes regular, self-attention layers with Kimi Delta Attention. And Kimi Delta Attention is just the latest in a thread of evolution which goes back (sorta!) to RNNs.

State space models

For a long time, folks modelled control systems using state-space models. These return both an output and a state, and have a linear update function. RNNs such as LSTMs weren’t strictly state-space models in part because of their use of non-linearities: when updating the memory LSTMs used a tanh activation, for example. If you hand-wave a bit and ignore that, you’re looking at a very similar process.

But there is a gap between hand-waving and science, and luckily someone crossed it. The benefit of that activation function was that it squashed the state into a known range and avoided the vanishing gradient issue that plagued RNNs. The key realization was that you can drop the non-linearity entirely4 as long as the weight matrix that multiplies the hidden state is well behaved (specifically, has eigenvalues close to, but less than, one).

Much of this is in the HiPPO and S4 papers, with Albert Gu, Chris Ré and Tri Dao. This was another neat idea, which included a clever bit of linear algebra with a technique called Diagonal+Low Rank to make the state updates relative efficient, but didn’t perform as well as regular transformer models. Gu and Dao identified the challenge as those well-behaved weights that updates the hidden state. Much like with RNNs prior to LSTMs they were adding a fixed amount of information from the input to the state. In Mamba they reused the same kind of trick: adding a small network to gate the updates so the model can learn remember more, or less, depending on the specific input5.

Then, in the Mamba 2 paper from 2024, Gu and Dao brought everything together. They showed that the 2020 style linear attention, with a decay mask, was the same as a structured state space model like Mamba 1. That means they could apply the same chunking tricks in linear attention and get much better scaling and training, but with the ability to handle long sequences Mamba had.

The slow recreation of LSTM features in more scalable forms continued with Gated DeltaNet. The Mamba approach ‘faded’ old memories via a decay, but it couldn’t explicitly subtract information like the LSTM forget gate. Gated DeltaNet also calculated the difference (the delta) between the expected and actual state, allowing it to effectively edit the memory rather than just overwriting it6.

Kimi Linear sped this up, and improved the fading mechanism to be per-dimension rather than a single rate across the memory:

“Crucially, KDA parameterizes its transition dynamics with a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) matrices [30, 71], enabling a bespoke chunkwise-parallel algorithm that substantially reduces computation relative to general DPLR formulations while remaining consistent with the classical delta rule. Kimi Linear interleaves KDA with periodic full attention layers in a uniform 3:1 ratio.”

They manage to solve two birds with one stone linear algebra: They reused the DPLR trick from S4 let you take a diagonal vector for the update rate and apply it across the matrix product of a low-rank approximation for the state transition. Moonshot realized that you could replace the approximation with the K and V matrices directly, which is much more efficient, and that you could have the diagonal come from a vector of the same dimension, so you get per-channel forgetting.

Compression & Recall

It seems likely we will see more sophisticated mixing of different types of attention in models as labs continue improving architectures. We started with recursive models as a natural expression of the problem, moved to transformers for scale, and have been slowly integrating the two expressions together. We are still just trying to predict the next word, but it turns out the best way to do it is to remember some things, forget most things, and accept that the map is not the territory.

Reading through the papers on this journey really highlighted how the field moves between compression and breadth of recall. Sometimes researchers get a bad rap from their engineering brethren for being disconnected from reality, but this chain of evolutions is a pragmatic one.

You want to get the most intelligence in the model as possible. That’s done by compressing the training data into efficient, useful and general representations, but finding those representations is hard! If you hit a limit in finding them, then one approach is to simply add more knowledge: add more parameters, consider more training data, and build more of the imperfect representations to give you more options to choose from.

MoEs, synthetic data, and various other aspects of modern model training are playing with this same trade off: represent better or represent more. After his recent HotChips talk, Noam Shazeer was asked how we can find more efficient ways of encode knowledge into parameters, closer to how the brain does it. He responded first by asking the questioner: “why are you limited on parameters?”

  1. The idea dates back to Jeff Elman, I think, who showed that training a network on this objective caused the network to learn grammar categories and other features of English. ↩︎
  2. This kind of thing is even hard for humans at sufficient lengths of text: there is a version of War & Peace in English that is largely the original (translated, natch), but normalizes all the character names as they were such a common point of confusion ↩︎
  3. In the original paper they kept the same encoder/decoder set up as with earlier models, as its eminently sensible for translation tasks. The GPT models and others demonstrated you could go decoder-only effectively. What we tend to call “prefill” these days is effectively a (causal) encoder step within the decoder model that contextualizes the input, then the “decoder” is the autoregressive generation process after. ↩︎
  4. There actually still is non-linearity, as you need it for neural networks in general but rather than doing it in the loop memory update, it happens in projection MLPs after the layer. Then in Mamba it moved into the gating, so it’s only dependent on input, not the h_{t-1} state! ↩︎
  5. And it was Orvieto and the DeepMind folks that showed that you can get the same results in an RNN without the non-linearities if you can set up the matrix right. ↩︎
  6. Part of this reason was recall, which Jamba addressed. Because the RNN approach is inherently compression based it was harder to just cut and paste sections of the context when they were relevant. Jamba mixed regular attention layers with Mamba layers, giving back the global context while still providing better scaling. The specific recall problem is really emphasized by the fact that one of the standard long context evals is the “needle in a haystack” task, where a relevant fact is hidden in a long doc and needs to be pulled out. ↩︎

Discover more from Ian’s Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading