https://arxiv.org/html/2311.08105v3
Interesting paper from Google doing data-parallel training with an inner-optimizer, then a central node that collects gradients every so often, optimizes and shards back out. Somewhat federated learning like, but avoids full-sync across a larger cluster. Important ideas as scaling starts exceeding network capabilities.
standard approaches to training LLM require a large number of tightly interconnected accelerators, with devices exchanging gradients and other intermediate states at each optimization step. While it is difficult to build and maintain a single computing cluster hosting many accelerators, it might be easier to find several computing clusters each hosting a smaller number of devices. In this work, we propose a distributed optimization algorithm, Distributed Low-Communication (DiLoCo), that enables training of language models on islands of devices that are poorly connected. The approach is a variant of federated averaging, where the number of inner steps is large, the inner optimizer is AdamW, and the outer optimizer is Nesterov momentum