Training large models today is tightly coupled to specific hardware. This makes moving workloads across systems or abstracting the hardware almost impossible without losing efficiency, and hence why you don’t tend to see a lot of uptake of the kind of cloud-like abstractions we see elsewhere.
- Gang semantics: Large models rely on precise scheduling for memory, networking, and compute to achieve high utilization. These tend to be model and hardware specific, and are hard to abstract.
- Compute efficiency: Compute ops like GEMMS should be less exposed to model quirks and are already more abstractable, but a lot of custom work is done on number format support and shape optimizations for specific models.
My entirely unfounded prediction is that this stuff is getting easier, and we will see more standardization over the next few years.
- Slow Down in Number Formats: Research like “Scaling Laws for Precision” (https://arxiv.org/pdf/2411.04330) shows a tradeoff between precision and parameter count. There’s a lot of folk knowledge in getting formats like FP8 stable, and it’s not totally clear how much FP4/MXFP4 and their ilk will add: my guess would be less, and they will be used in more targeted (and perhaps predictable) ways. Either way, I expect things to get less choppy and more predictable on the compute side, eventually.
- Parameter Stabilization: Model size growth may well plateau, either for fundamental reasons (e.g. say we have enough model capacity in the 2-4tn params range for all the data) or to become more aligned with practical cluster sizes for scale-out networking (e.g., 72 GPUs with Blackwell). Whether this is for a model or a set of experts in an MoE I don’t know – and it feels like there is room for some variants of MoE routing architectures if we find that pattern particularly successful.
- Shift To Test-Time: As training stabilizes, focus will shift to test-time compute — the pain points there being handling longer sequences and optimizing KV caches, which feel like more general/repeatable problems. I see this in part as moving a chunk of the pool of “large job expertise” from pretraining focused to inference focused, which then opens the door to the benefit of the tools/standards to help scale on the pretrain side.