Decoupled DiLoCo: The Up-to-date Frontier for Resilient, Distributed Artificial Intelligence Training

Share

Disconnected DiLoCo is not only more fault-tolerant, but is also practical for fully distributed production-level initial training. We successfully trained a 12-billion-parameter model in four separate regions of the United States using a 2-5 Gbps wide-area network (a level relatively achievable using existing inter-facility internet connectivity in data centers, rather than requiring up-to-date, custom inter-facility network infrastructure). Notably, the system achieved this learning result more than 20 times faster than conventional synchronization methods. This is because our system accommodates required communication over longer computation periods, avoiding “blocking” bottlenecks where one part of the system has to wait for another.

Driving the evolution of AI training infrastructure

At Google, we take a comprehensive approach to AI training, spanning hardware, software infrastructure and research. Increasing benefits come from rethinking how these layers fit together.

One example is the decoupled DiLoCo. By enabling training tasks at Internet bandwidth, it can leverage any unused computing power wherever it is, turning stranded resources into usable capacity.

In addition to performance and resiliency, this training paradigm also unlocks the ability to combine different hardware generations, such as TPU v6e and TPU v5p, in a single training run. This approach not only extends the life of existing hardware, but also increases the total computing power available for model training. In our experiments, different generations of chips running at different speeds still matched ML performance in single-chip training runs, ensuring that even older hardware can significantly speed up AI training.

Moreover, because up-to-date generations of equipment do not arrive everywhere at once, the ability to train generations can alleviate recurring logistical and performance bottlenecks.

As we push the boundaries of AI infrastructure today, we continue to explore approaches to the resilient systems needed to unlock the next generation of AI.

The AI Sckool

Categories

Decoupled DiLoCo: The Up-to-date Frontier for Resilient, Distributed Artificial Intelligence Training

Driving the evolution of AI training infrastructure

China’s open AI models challenge Silicon Valley’s playbook

Even subtle sleep deprivation can lead to weight gain

Accelerating the frontiers of scientific discovery: Google’s $40 million commitment to the Genesis mission

Run the Mythos enhanced encoding model locally using llama.cpp and Pi files

Introducing Gemini 3.6 Flash, 3.5 Flash-Lite and 3.5 Flash Cyber

More News

Accelerating the frontiers of scientific discovery: Google’s $40 million commitment to the Genesis mission

Introducing Gemini 3.6 Flash, 3.5 Flash-Lite and 3.5 Flash Cyber

Our approach to bioresistance

Empowering India’s next generation of innovators with ATL Saathi

China’s open AI models challenge Silicon Valley’s playbook

Even subtle sleep deprivation can lead to weight gain

Accelerating the frontiers of scientific discovery: Google’s $40 million commitment to the Genesis mission