Imagine that you will be working in a warehouse or office in the near future and you will be asked to support a up-to-date intern learn the basics of his job. The catch: it’s a robot. To teach them this, you can play a game of “show and tell”, which is physically showing you how to do something in several different ways while explaining what you are doing.
Let’s say you asked a robot to put coffee on your desk without disturbing you during a Zoom call. You prefer that the robot does not get too close to you and the laptop so as not to disrupt the meeting. To enable this behavior, the robot must be trained on data that clearly shows the execution of the entire task. Computer scientists have tried to explain manipulation tasks to robots by recording many physical demonstrations or writing lengthy instructions. However, if you don’t have both, the machine will likely misunderstand what it needs to do.
Showing and telling all this is laborious for humans, so researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) automated the robot’s training process, automatically explaining instructions and using almost five times less demo data. Their Masked Inverse Reinforcement Learning (Masked IRL) approach uses the Gigantic Language Model (LLM) to develop ambiguous prompts based on data collected from the user demo. The next LLM then narrows down the details the algorithm should include in its motion plan so the robot can safely perform work in homes, offices and factories.
“Our approach can be useful when a human is interacting with a robot but doesn’t want to describe all the details of the task,” says MIT graduate student and CSAIL researcher Minyoung Hwang, who is the lead author paper project presentation. “We minimize human effort by enabling machines to get to the heart of what users really want.”
According to Hwang, Masked IRL can support robots safely maneuver in places where there are elements that a human might not describe in a tooltip but are nevertheless crucial. For example, a machine grabbing a snack from the kitchen may not know how to avoid colliding with a laptop. Similarly, a factory robot placing products into different boxes must carefully navigate the shelves.
To learn up-to-date tasks in such situations, Masked IRL uses the robot’s sensors to capture information about its surroundings. These components also record every movement as part of kinesthetic demonstration, a training approach in which a human physically moves the robot to perform a specific action. It’s a bit like being a physical therapist operating a machine and bending your joints in a specific direction to show the robot how to grab, move and place objects.
The MIT system then calls on LLM to compare this sequence of movements (called a trajectory) with the shortest possible path. The model also clarifies what may be unclear in the prompt by turning a request like “stay close” into “stay close to the table surface.” Using trajectory comparison and explained cues, the LLM begins to understand why the movements he has been trained in are vital to the task.
The second LLM then evaluates details of the environment, such as the location of obstacles and the shape of the robot’s target object. During this process, it “masks” (in other words, ignores) items that it considers irrelevant to the task at hand, giving each of them a rating of “1” (vital) or “0” (not very much). For example, whether or not the user was leaning on a table during the demonstration will have a value of “0”, meaning it doesn’t matter. Any detail that is considered a “1” is taken into account by the algorithm in the final action plan.
These masks gave Masked IRL a key advantage over comparable baselines in both 3D and real-world demonstrations by teaching the robot which information to prioritize. Thanks to the researchers’ system, both virtual and real robots were able to skillfully maneuver objects around obstacles, for example moving a coffee mug around a laptop to different places on the table. In these tasks, masked IRL correctly identified users’ preferences that they did not explicitly express in their prompts as much as 15 percent more often than comparable baselines.
During simulation experiments, CSAIL researchers also found that Masked IRL learned quickly. Understanding how to move the cup required fewer demonstrations than the base version. They also found that the robots performed better when the LLM explained the instructions, rather than forcing the machine to follow unclear requests.
This more focused approach also translated well to a real robotic arm executing commands the system didn’t see in the training phase. After being trained in 50 kinesthetic demonstrations, the robot carefully moved the cup toward the human, avoiding colliding with the user’s computer – an obstacle it learned to avoid by developing a more general “stay away” request. He also wiped the table while “staying close” and handed the user a bag of chips while “staying away” from both the man and the table.
Masked IRL senses and clarifies what users leave unsaid, but can soon “see” it too. CSAIL scientists plan to make their approach more energetic by equipping it with cameras that enable the robot to take photos of its surroundings. It could then highlight and focus on specific items nearby. For example, if you ask the machine to pick up a toy, it may see some bananas nearby and ignore them before moving on to the target object.
Hwang wrote the paper with three CSAIL colleagues: graduate student Alexandra Forsey-Smerek ’20, SM ’22; postdoc Nathaniel Dennler; and MIT assistant professor Andreea Bobu, who is a member of the Department of Aeronautics and Astronautics and CSAIL. Their work was supported in part by the Tata Group through an MIT Generative AI Impact Consortium Award and the Department of Defense. They will present the project at the 2026 IEEE International Conference on Robotics and Automation in June.
