Disconnected DiLoCo is not only more fault-tolerant, but is also practical for fully distributed production-level initial training. We successfully trained a 12-billion-parameter model in four separate regions of the United States using a 2-5 Gbps wide-area network (a level relatively achievable using existing inter-facility internet connectivity in data centers, rather than requiring up-to-date, custom inter-facility network infrastructure). Notably, the system achieved this learning result more than 20 times faster than conventional synchronization methods. This is because our system accommodates required communication over longer computation periods, avoiding “blocking” bottlenecks where one part of the system has to wait for another.
Driving the evolution of AI training infrastructure
At Google, we take a comprehensive approach to AI training, spanning hardware, software infrastructure and research. Increasing benefits come from rethinking how these layers fit together.
One example is the decoupled DiLoCo. By enabling training tasks at Internet bandwidth, it can leverage any unused computing power wherever it is, turning stranded resources into usable capacity.
In addition to performance and resiliency, this training paradigm also unlocks the ability to combine different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the life of existing hardware, but also increases the total computing power available for model training. In our experiments, different generations of chips running at different speeds still matched ML performance in single-chip training runs, ensuring that even older hardware can significantly speed up AI training.
Moreover, because up-to-date generations of equipment do not arrive everywhere at once, the ability to train generations can alleviate recurring logistical and performance bottlenecks.
As we push the boundaries of AI infrastructure today, we continue to explore approaches to the resilient systems needed to unlock the next generation of AI.
