Megacores

What we do in machine learnings owes a lot to the history of computer graphics. Folks like Kurt Akeley, one of the founders of SGI, identified that 3D graphics have a naturally pipelined structure. You have a high volume of similar operations, such as applying pixel-y soldier textures to a mesh of triangles, and by pipelining them you can find an opportunity for a high degree of parallelism.

Akeley was one of the drivers of OpenGL, which provided a standard interface to that pipeline, and later worked with Nvidia on CG, a realtime shader language and compiler. Shader languages, as used in Pixar’s RenderMan and other non-realtime 3D use cases, introduced an approach where you could manage lighting programmatically by describing the transforms to each individual element. The shader would be run in parallel across all the geometry or pixels it was addressing.

With CUDA, Ian Buck and others at Nvidia helped formalize what had been true in the hardware for a while: GPUs were massively parallel processing machines, not just polygon factories. CUDA was part of a move from the supercomputer approach of Single Instruction Multiple Data (SIMD) to Single Instruction Multiple Thread (SIMT). On a Cray or other vector oriented processor you had to pack the work into a vector. CUDA let programmers familiar with CPU threads think in those terms instead. Under the hood, the threads in a warp were executed in lockstep, but they could be masked off to allow for divergence. It was flexible, fast, and attracted the attention of the machine learning community. Because so much of ML is large matmuls, Nvidia bolted on Tensor Cores as specialized co-processors that handled blocks of matrix math efficiently. This combination of performant hardware and flexible software helped make Nvidia the most valuable company in the world, and drive up house prices across the Bay Area.

But, it transpires, not everyone loved shoveling their margin to Jensen, and they looked for more cost-efficient ways to run ML workloads. The flexibility for threads to branch, pause or switch requires infrastructure and silicon. You need big register files per core, multiple levels of memory to cache, and logic to manage swapping in and out warps.

If you look at the “do the math” parts of a chip, a CPU probably only spends about 10% of silicon on that, with the rest managing the chaos of running an operating system: branch prediction, caching, data movement. A GPU, in contrast, is a wildly efficient machine, with maybe 30-40% of the silicon dedicated to mathing effectively.

When Google looked at the problem of running inference at their scale back in the dark ages of 2016 they wanted to spend as much of their budget as possible doing the math, to keep the costs as low as they could. The chip they created, the Tensor Processing Unit (TPU) recently hit its 7th iteration and SemiAnalysis published an extensive breakdown on it: TPU v7 Ironwood, quickly followed up with a deep dive Amazon’s Trainium v3.

Trainium3 takes a similar approach to Trainium2 and Google’s TPU and builds the chip out of a small number of large NeuronCores. This contrasts with GPU architectures like Nvidia and AMD’s, which instead uses a large number of smaller tensor cores. Large cores are typically better for GenAI workloads since they have less control overhead.

Dylan and his team are touting these as the first chips to genuinely threaten Nvidia’s moat. The big frontier labs seem interested, with deals and investigation from Anthropic, OpenAI, Meta and others. As the piece repeatedly points out, if you want to understand the dominance of Nvidia you have to focus on the system, and not the microarchitecture. So, of course, I want to talk exclusively about the microarchitecture here.

TPU, Trainium, as well as other custom approaches like Meta’s MTIA¹ lean on an approach called Systolic Arrays. As a recap, Nvidia’s Streaming Multiprocessor (SMs), AMDs compute units ,and so on are cooperative multiprocessors. They access registers, talk to caches and handle the flow of data. Threads can request data if it’s not ready and the hardware warp schedulers will swap in another piece of work to keep the chip humming.

Systolic arrays are different. The name comes from systole, the phase where your heart pumps blood. In a systolic array, you load your data once and fire it through a grid of Processing Elements (PEs). Each element maths its math then passes the result to its neighbor on the next clock tick.

This was very much in line with the needs of the original TPU: load a set of model weights up, then pump user requests through as efficiently as possible. TPUv1 only supported int8: it was a low-bit, high-efficiency matmul machine. The data flow needed to be pre-determined: you set it up and make it go, which made it incredibly silicon efficient. You don’t need lots of caches or schedulers, and in fact the original TPU didn’t have any at all!

The con of course was that you have to get it right! If the data isn’t there to pump in, the whole thing just waits. There is no backup plan to another warp, no other threads. Not only that, but because the systolic arrays are generally a lot bigger (say 256×256 vs the Tensorcores 16×16), you have fewer of them. While an Nvidia GPU might have more than 100 SMs, a Trainium v3 has 8 cores, and a TPU has just 2. Each core is a lot larger, and wasting it gets a lot more expensive.

Presumably Jeff Dean just programmed these right the first time, but for the rest of Google (and later the world) they spent years building XLA (Accelerated Linear Algebra), a full-graph compiler. In GPU kernel programming the challenge is hiding memory latency and managing register pressure. On a TPU-type approach, there is one massive VMEM that fulfills a similar role as the registers and no memory hierarchy, but you can’t rely on the hardware to swap between jobs. XLA needs to know exactly how the graph works so that it can schedule the right data at the right time.

TPUs used a VLIW architecture: Very Long Instruction Words. Rather than a traditional instruction set with diverse instructions, VLIW lets you bundle Very Long packages of instructions into single units (kind of a silicon equivalent of German) which execute operations on each of the different units of the core at the same time. This was introduced in TPU v2, and its where the pressure on the compiler really multiplied.

To draw a GPU analogy, if you think about something like a Relu(AxB+C) you have a graph of operations: AxB -> Result, Result + C -> Result2, Relu(Result2). To optimize that you could use an CUDA graph to compile it into single kernel dispatch and CPU/GPU communication. One step further would be kernel fusion: keep all the intermediate results in registers and write one kernel that avoids the back and forth to higher tier memory. That lets you bundle up even more , but you have to have even higher confidence in the sizes involved to avoid running out of registers,

VLIW is like parallel kernel fusions: a TPU v2 had 2 matrix units, 2 vector units, 2 scalar units and 2 memory load/store units².To keep them busy every step the compiler needs to plan ahead enough to give each of them something useful to do. VLIW instructions bundle those ops along with any constants needed into a single instruction. Fusion goes from being an optimization to being a necessity. Once you get it though, you can spend more like 50-60% of your silicon on the part you care most about, and that translates into an excellent total cost of ownership.

Does this mean we should all be cancelling our Rubin orders and buying TPUs? I mean, no. But there is some nuance. Choosing between flexible streaming processors or efficient systolic megacores feels drastic, but I think it might not matter quite as much as it seems.

Research still overwhelmingly benefits from flexibility. You are running experiments, solving bottlenecks and debugging. Nvidia tends to be the big lab tool of choice thanks to the flexibility, the depth of tooling and the general CUDA ecosystem³.

If you are mainly serving a massive model, it’s worth the investment to lock down all the weirdness and optimize it. That’s where the megacore chips have proved their mettle first, with TPU, Inferentia⁴, MTIA and others all starting on that side of the house.

Folks like Akeley and Buck realized that when you’re building a chip you’re really building a programming model. Get that right, and the model can long outlast the hardware. Balancing expressivity with performance is the thing that lets a platform win: who best lets researchers and engineers define the future without fighting the silicon.

What seems to be emerging isn’t quite the SIMT/CUDA architecture: its something around expressing the dataflow of tiles on the critical kernels⁵, while relying on a compiler to optimize the larger graph and compute.

Making sure that you have access to the right software might be more important than trying to perfectly identify which hardware platform is the once and future king. But also, look, the world moves fast and if you get a Prime Day deal on Trainium instances, you should probably just take it. The hardware can and will change and it can always be adopted, as the frontier labs are showing. If we keep hunting for the expressivity we need, as OpenGL, CUDA, Triton and others have over the years, we will keep unlocking the possibilities in whatever hardware is available.

Disclosure: I work at Meta and like these chips a lot, though no one would let me anywhere near any chip design, luckily enough ↩︎
Newer versions have others too, like the sparse cores in TPU v6 and v7 which are basically dedicated embedding management processors ↩︎
With the notable exception of Google themselves, though the Jax-XLA-TPU ecosystem is very rich internally ↩︎
Amazon remain undefeated at naming things ↩︎
From system to VMEM on megacore approaches, from SMEM to registers on GPUs ↩︎

Megacores

More posts

Native DSLs Ops in PyTorch

Programming The Loop

Cutie Fly

Perplexed

Megacores

More posts

Native DSLs Ops in PyTorch

Programming The Loop

Cutie Fly

Perplexed

Discover more from Ian’s Blog