Despite the stunning advances in artificial intelligence in recent years, robots remain stubbornly stupid and constrained. Those found in factories and warehouses typically go through precisely choreographed routines, without much ability to perceive their surroundings and adapt on the fly. The few industrial robots that can see and grasp objects can only perform a constrained number of activities with minimal dexterity due to their lack of overall physical intelligence.
Overall, capable robots could take on a much wider range of industrial tasks, perhaps with minimal demonstrations. Robots will also need more general skills to cope with the enormous variability and clutter in human homes.
General excitement about advances in artificial intelligence has already translated into optimism about significant novel advances in robotics. Elon Musk’s car company, Tesla, is developing a humanoid robot called Optimus and Musk recently suggested that it will be widely available for $20,000 to $25,000 and will be able to perform most tasks by 2040.
Previous efforts to teach robots to perform arduous tasks have focused on training a single machine on a single task because the learning seemed impossible to transfer. Some recent academic work has shown that, with enough scale and refinement, learning can be transferred across different tasks and robots. Google’s 2023 project called Open version X involved sharing robotic learning between 22 different robots in 21 different research laboratories.
A key challenge with a physical intelligence strategy is that there is not the same scale of robot data available for training as there is for huge text-based language models. The company must therefore generate its own data and develop techniques to improve learning from a more constrained data set. To develop π0, the company combined so-called vision language models, which learn on images and text, with diffusion modeling, a technique borrowed from AI image generation to enable a more general type of learning.
For robots to be able to perform whatever tasks a person asks them to do, this learning needs to be scaled up significantly. “We still have a long way to go, but we have what you might call a scaffolding of what’s to come,” says Levine.
