In the heady world of AI progress, context lengths have seen somewhat more languid growth. After rapid progress up to the 100-300k token range, they’ve largely stayed there for frontier models. We now have a couple of 1m token models that appear economically viable1, with Gemini and Sonnet, but Opus 4.5 (for example) stuck with the 200k window of its predecessor.
The fundamental challenge with long contexts is the interaction between tokens, particularly in the prefill (prompt processing) phase where you have to do this for a whole lot of tokens at once before you can generate anything.
For each token attention calculates:
1) The key: when to use this token
2) The value: what information this token contributes
3) A query: what each token is looking for
Each2 token’s query is compared against prior tokens’ keys to get weighted scores; the resulting weights mix those tokens’ values.
Then, in decoding you make this calculation repeatedly. The 500th token has a new Key, Value and Query, but the 1st token has the same.
It turns out you can save yourself a lot of work by just keeping around the Keys and Values from the previous generation and loading it in for the prior tokens. Then you just have to update for the newly added token. This happens for every layer in the model, so it’s a significant amount of computation saved.
Of course, you have to stick that cached copy somewhere. Because it’s used in each round of generation it needs to be rapidly available, to avoid adding a bunch of latency. In practicality that means it has to be in the high bandwidth memory, which is a scarce resource. So the longer the context the more memory you need to hold it, and the more memory you need for a bigger cache.
Larger context windows have been unlocked in large part by more memory on the card and in a somewhat smaller part by more rapid scale-out interconnect like NVlink3.
Meanwhile, here’s Uncle Jensen at CES, via Stratchery‘s excellent analysis of the announcements:
this context memory, which started out fitting inside an HBM, is no longer large enough. Last year we created Grace Blackwell’s very fast memory, we call fast context memory. That’s the reason why we connected Grace directly to Hopper. That’s why we connected Grace directly to Blackwell, so that we can expand the context memory. But even that is not enough, and so the next solution, of course, is to go off onto the network, the north-south network, off to the storage of the company. But if you have a whole lot of AIs running at the same time, that network is no longer going to be fast enough. So the answer is very clearly to do it different, and so we created Bluefield-4 so that we could essentially have a very fast KV cache context memory store right in the rack.
It’s quite possible this kind of in-rack memory will unlock significantly larger context windows. I do wonder what this will mean for actually using long-context models. Dealing with multiple-million tokens of context is still going to take a bit of time to process. For the kind of interactive use cases that have worked best with LLMs (Claude Code, Computer Use, Cowork etc.) I suspect latency will be a bit of a pain point.
What is kind of interesting is that all the providers at this point have some form of prompt caching option. Most of the time with a KV cache you build it up as you go, but in some cases you are going to actually generate the exact same cache in multiple different sessions. A good example would be a long system prompt: you can generate the KV cache for that, stick it on slower memory4 and then load it in to HBM for a new session. This can save a bunch of compute, and is very practical for a lot of use cases.
One interesting thing this might do is enable “whole codebase” type queries: the vast majority of assets (e.g. code) in a given work session won’t change, so you could cache the KV for everything, and have it in context for later use
I’m hopeful that as Blackwell, TPUv7 and MI450 come online we will see context lengths unstick and move up, and perhaps with Vera Rubin we will really get rid of “compacting” for some practical set of cases.
- So many asterisks should go here after this flagrant assertion ↩︎
- Technically in most cases this is between each token’s Query and the Keys of the tokens before it, thanks to causal masking ↩︎
- You have to do some work to distribute things of course, but if your model is multi-card anyway, then you can distribute the KV cache fairly easily. TPUs have chonky scale-out bandwidth, probably one of the reasons Google was able to offer 1M first. ↩︎
- For clarity, this might not actually be how its implemented at Throppy/Google/OAI, they might just keep it in HBM anyway. But it feels like you could do that? ↩︎