Imagine you buy a robot to do your housework. That robot was built and trained in a factory to do a specific set of tasks, and it has never seen the objects in your home. When you ask it to pick up a cup from the kitchen table, it might not recognize your cup (perhaps because the cup has an unusual image of, say, MIT’s mascot, Tim the Beaver). So the robot fails.
“Right now, the way we train these robots, when they fail, we don’t really know why. So you just throw up your hands and say, ‘OK, I guess we have to start over.’ The critical piece that’s missing from this system is allowing the robot to demonstrate why it’s failing so the user can give it feedback,” says Andi Peng, a graduate student in electrical engineering and computer science (EECS) at MIT.
Peng and her colleagues at MIT, Modern York University, and the University of California, Berkeley, created structure that allows people to quickly teach a robot what they want from it with minimal effort.
When the robot fails, the system uses an algorithm to generate counterfactual explanations that describe what had to change for the robot to succeed. For example, perhaps the robot would have been able to pick up a cup if the cup had been a certain color. It shows these counterfactual explanations to a human and asks for feedback on why the robot failed. The system then uses that feedback and the counterfactual explanations to generate up-to-date data, which it uses to fine-tune the robot.
Fine-tuning involves modifying a machine learning model that has already been trained to perform one task so that it can perform a second, similar task.
The researchers tested the technique in simulations and found that it could train a robot more efficiently than other methods. Robots trained using the framework performed better, while the training process took up less human time.
This framework could aid robots learn more quickly in up-to-date environments without requiring the user to have technical knowledge. In the long term, it could be a step toward enabling general-purpose robots to efficiently perform everyday tasks for the elderly or disabled in a variety of environments.
Lead author Peng was joined by co-authors Aviv Netanyahu, an EECS graduate student; Mark Ho, an assistant professor at Stevens Institute of Technology; Tianmin Shu, a postdoctoral fellow at MIT; Andreea Bobu, a graduate student at UC Berkeley; and senior authors Julie Shah, a professor of aeronautics and astronautics at MIT and director of the Interactive Robotics Group at the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Pulkit Agrawal, an EECS professor and fellow at CSAIL. The research results will be presented at the International Conference on Machine Learning.
During employee training
Robots often fail due to distribution shift—the robot is presented with objects and spaces it has not seen during training, and it does not understand what to do in this up-to-date environment.
One way to retrain a robot for a specific task is through imitation learning. The user can demonstrate the correct task to teach the robot what to do. If the user tries to teach the robot to pick up a cup, but demonstrates it with a white cup, the robot can learn that all cups are white. It might then fail to pick up the red, blue, or brown “Tim-the-Beaver” cup.
To teach a robot to recognize that a cup is a cup, regardless of color, would require thousands of demonstrations.
“I don’t want to demonstrate with 30,000 cups. I want to demonstrate with one cup. But then I have to teach the robot to recognize that it can pick up any color cup,” Peng says.
To achieve this, the researchers’ system determines what specific object the user is interested in (the cup) and what elements are not essential to the task (perhaps the color of the cup is irrelevant). It uses this information to generate up-to-date, synthetic data by changing these “irrelevant” visual concepts. This process is known as data augmentation.
The framework has three steps. First, it shows the task that caused the robot to fail. Then, it collects a demonstration of the desired actions from the user and generates counterfactuals by searching all the features in the space that show what needs to change for the robot to succeed.
The system shows these counterfactuals to the user and asks for feedback to determine which visual concepts do not affect the desired action. It then uses this human feedback to generate many up-to-date extended demonstrations.
In this way, the user could demonstrate picking up one cup, but the system would produce demonstrations showing the desired action with thousands of different cups by changing color. It uses this data to fine-tune the robot.
Peng believes that for this technique to be effective, creating counterfactual explanations and soliciting feedback from users is key.
From Human Reasoning to Robot Reasoning
Because their work aims to include humans in the training loop, the researchers tested their technique on human users. First, they conducted a study in which they asked people whether counterfactual explanations helped them identify elements that could be changed without affecting the task.
“It was so clear from the very beginning. Humans are so good at this kind of counterfactual reasoning. And that counterfactual step allows human reasoning to be translated into robot reasoning in a way that makes sense,” he says.
They then applied their framework to three simulations in which robots were tasked with navigating to a target object, picking up a key to open a door, and picking up a desired object and then placing it on a table. In each case, their method allowed the robot to learn faster than other techniques, while requiring less demonstration from users.
Going forward, the researchers hope to test the framework on real robots. They also want to focus on reducing the time it takes the system to create up-to-date data using generative machine learning models.
“We want robots to do what humans do, and we want them to do it in a semantically meaningful way. Humans tend to operate in this abstract space where they don’t think about every single property of an image. Ultimately, it’s about enabling the robot to learn a good, human representation at an abstract level,” Peng says.
This research is supported in part by a National Science Foundation Graduate Research Fellowship, Open Philanthropy, the Apple AI/ML Fellowship, Hyundai Motor Corporation, MIT-IBM Watson AI Lab, and the National Science Foundation Institute for Artificial Intelligence and Fundamental Interactions.