Tuesday, December 24, 2024

Design your home robots to have some common sense

Share

From wiping up spills to serving food, robots are learning to perform increasingly sophisticated household tasks. Many such home robot trainees learn through imitation; they are programmed to copy the movements that a human physically guides them through.

It turns out that robots are excellent imitators. But unless engineers also program them to adapt to every possible bump and nudge, robots won’t necessarily know how to deal with such situations unless they start their task from the top.

Now engineers at MIT want to give robots a little common sense when faced with situations that push them off their trained path. They have developed a method that combines robot movement data with “common sense knowledge” about gigantic language models, or LLM.

Their approach allows the robot to logically parse multiple household tasks into subtasks and physically adapt to disruptions in the subtask, so the robot can continue working without having to go back and start the task from scratch – and engineers must explicitly program fixes for any possible failure along the way.

Photo courtesy of researchers.

“Imitation learning is a mainstream approach to enabling home robots. However, if the robot blindly imitates human movement trajectories, small errors can accumulate and ultimately derail the rest of the implementation,” says Yanwei Wang, a graduate student in MIT’s Department of Electrical Engineering and Computer Science (EECS). “With our method, the robot can independently correct execution errors and improve the overall success of the task.”

Wang and his colleagues describe their modern approach in detail in: test will present at the International Conference on Learning Representations (ICLR) in May. Co-authors of the study include EECS graduate students Tsun-Hsuan Wang and Jiayuan Mao, Michael Hagenow, a postdoc in the MIT Department of Aeronautics and Astronautics (AeroAstro), and Julie Shah, the HN Slater Professor of Aeronautics and Astronautics at MIT.

Language task

The scientists illustrate their modern approach with a straightforward activity: scooping balls from one bowl and pouring them into another. To accomplish this task, engineers typically moved the robot in scooping and pouring motions—all along a single fluid trajectory. They can do this multiple times to give the robot some human demonstrations to follow.

“But human demonstration is a long, continuous trajectory,” Wang says.

The team realized that although a human could demonstrate one task at a time, that task depended on a sequence of subtasks, or trajectories. For example, the robot must first reach into the bowl before it can scoop up the marbles, then it must collect the marbles before moving to the empty bowl, and so on. If the robot is pushed or prodded into making an error during any of these subtasks, the only solution is to stop and start over, unless engineers were to clearly label each subtask and program or collect modern demonstrations so that the robot could recover from said failure. to allow the robot to repair itself at any given time.

“This level of planning is very tedious,” Wang says.

Instead, he and his colleagues found that some of this work could be performed automatically by the LLM. These deep learning models process huge libraries of text, which they operate to establish connections between words, sentences, and paragraphs. With these connections, LLM can then generate modern sentences based on what it has learned about the type of word that is likely to follow the last one.

For their part, researchers found that in addition to sentences and paragraphs, LLMs may be asked to create a logical list of subtasks that will be involved in a given task. For example, if you are asked to list the steps involved in transferring marbles from one bowl to another, the LLM can generate a sequence of verbs such as “reach”, “scoop”, “transport” and “pour”.

“LLMs can tell you how to complete each step of a task, in natural language. The continuous human demonstration is the embodiment of these steps in physical space,” says Wang. “We wanted to combine both so that the robot automatically knows what stage the task is at and can replan and recover on its own.”

Marble mapping

For the modern approach, the team developed an algorithm to automatically combine a natural language LLM label for a specific subtask with the robot’s position in physical space or an image encoding the robot’s state. Mapping a robot’s physical coordinates or an image of its state onto a natural language label is called “grounding”. The team’s modern algorithm is designed to learn a grounded “classifier,” which means it learns to automatically identify the semantic subtask the robot is in – such as “reach” or “scoop” – based on its physical coordinates or view image.

“The grounding classifier facilitates a dialogue between what the robot is doing in the physical space and what LLM knows about the subtasks and the constraints to pay attention to within each subtask,” explains Wang.

The team demonstrated this approach in experiments with a robotic arm they trained on a marble-picking task. Experimenters trained the robot by physically guiding it through the first task of reaching into a bowl, picking up marbles, carrying them across an empty bowl, and pouring them inside. After a few demonstrations, the team used the pre-trained LLM tool and asked the model to list the steps for moving balls from one bowl to another. The researchers then used their modern algorithm to combine the defined LLM subtasks with the robot’s movement trajectory data. The algorithm automatically learned to map the robot’s physical coordinates on trajectories and the corresponding image view to a given subtask.

The team then allowed the robot to perform the scooping task on its own, using its newly learned ground classifiers. As the robot completed each step of the task, experimenters pushed and nudged it out of the way and also knocked balls off the spoon at various points. Instead of stopping and starting over or continuing blindly with no balls in the bucket, the bot was able to self-correct and complete each subtask before moving on to the next. (For example, he would make sure he had successfully scooped the balls before transporting them to the empty bowl.)

“With our method, when the robot makes mistakes, we don’t have to ask humans for programming or provide additional demonstrations on how to recover from failures,” Wang says. “This is extremely exciting because there is currently a huge effort put into training home robots using data collected from teleoperation systems. Our algorithm can now transform this training data into reliable robot behavior that can perform complex tasks despite external disturbances.”

Latest Posts

More News