In the classic cartoon “The Jetsons”, Rosie, a robotic maid, smoothly moves from vacuuming the house, through cooking dinner, to taking out the garbage. However, in real life, training a general-purpose robot remains a significant challenge.
Typically, engineers collect data specific to a specific robot and task, which they utilize to train the robot in a controlled environment. However, collecting this data is exorbitant and time-consuming, and the robot will likely have difficulty adapting to an environment or tasks it has not encountered before.
To train better general-purpose robots, MIT researchers have developed a versatile technique that combines extensive amounts of heterogeneous data from multiple sources into a single system that can teach any robot a wide range of tasks.
Their method involves combining data from different domains, such as simulations and real robots, and multiple modalities, including vision sensors and robot arm position encoders, into a common “language” that can process a generative artificial intelligence model.
By combining this huge amount of data, this approach can be used to train a robot to perform a variety of tasks without having to start training it from scratch each time.
This method can be faster and cheaper than time-honored techniques because it requires much less task-specific data. Additionally, it achieved over 20 percent better results when trained from scratch in real-world simulations and experiments.
“In robotics, people often say we don’t have enough training data. However, I think another major problem is that the data comes from so many different domains, modalities, and robotic equipment. Our work shows how you can train a robot by combining them all into one,” says Lirui Wang, an electrical engineering and computer science (EECS) graduate and lead author of the book article about this technique.
Wang’s co-authors include Jialiang Zhao, an EECS graduate student; Xinlei Chen, Research Associate at Meta; and senior author Kaiming He, associate professor at EECS and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). The research results will be presented at the Conference on Neural Information Processing Systems.
Inspired by LLM
Robotic “politics” relies on observations from sensors, such as camera images or proprioceptive measurements, that track the speed and position of the robot arm and then tell the robot how and where to move.
The rules are typically trained using imitation learning, which means a human demonstrates actions or remotely controls the robot to generate data that is fed into an artificial intelligence model that learns the rules. Because this method uses little task-specific data, robots often fail when their environment or task changes.
To develop a better approach, Wang and his colleagues drew inspiration from large language models such as GPT-4.
These models are pre-trained with a huge amount of data across languages and then fine-tuned by feeding them a small amount of task-specific data. Pre-training on this large amount of data helps models adapt to perform various tasks correctly.
“In the linguistic domain, all data is just sentences. In robotics, given all the heterogeneity of the data, if we want to do initial training in a similar way, we need a different architecture,” he says.
Robotic data takes many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique, with a different number and orientation of arms, grippers and sensors. Moreover, the environments in which data is collected vary greatly.
MIT researchers have developed a new architecture called heterogeneous pretrained transformers (HPT) that unifies data from these different modalities and domains.
In their architecture, they included a machine learning model known as a transformer that processes input data from vision and proprioception. A transformer is the same type of model that forms the basis of large language models.
Scientists match data from vision and proprioception with the same type of input, called tokens, that the transformer can process. Each input is represented by the same fixed number of tokens.
The transformer then maps all the input data into one common space, growing into a huge, pre-trained model as it processes and learns from more data. The larger the transformer, the better it will work.
The user only needs to provide HPT with a small amount of data about the design, robot configuration, and the task to be performed. The HPT then transfers the knowledge the transformer has gathered during initial training to learn the new task.
Enabling skillful movements
One of the biggest challenges in developing HPT was building a massive transformer pre-training dataset, which included 52 datasets containing over 200,000 robot trajectories in four categories, including human demonstration videos and simulations.
The researchers also had to develop an efficient way to convert raw proprioceptive signals from an array of sensors into data that the transformer could handle.
“Proprioception is key to enabling many skillful movements. Since the number of tokens in our architecture is always the same, we give equal importance to proprioception and vision,” explains Wang.
When they tested HPT, they found that it improved the robot’s performance by more than 20 percent in simulation and real-world tasks compared to training from scratch each time. Even when the task was very different from pre-training data, HPT still improved performance.
“The article presents an innovative approach to teaching one policy in many examples of robot execution. This enables training on different datasets, allowing robot learning methods to significantly increase the size of the datasets on which they can train. It also allows the model to be quickly adapted to new robot implementations, which is important because new robot designs are constantly being created,” says David Held, an associate professor at the Robotics Institute at Carnegie Mellon University, who was not involved in this work.
In the future, researchers want to explore how data diversity can improve the performance of HPT. They also want to improve HPT so that it can process unlabeled data like GPT-4 and other immense language models.
“Our dream is to have a universal robot brain that you can download and utilize for your robot without any training. While we are still in the early stages, we will continue to push difficult and hope that scaling will lead to a breakthrough in robotics policy, just as it did with immense language models,” he says.
This work was funded in part by the Amazon Greater Boston Tech Initiative and the Toyota Research Institute.