Imagine you need to clear up the mess in your kitchen, starting with a countertop littered with sauce packets. If your goal is to clear the counter, you can sweep packets as a group. However, if you want to select the mustard packets first and then throw away the rest, you can sort more specifically by sauce type. And if you missed Gray Poupon among the mustards, finding this particular brand would require more searching.
MIT engineers have developed a method that allows robots to make similarly intuitive decisions about a task.
The team’s novel approach, called Clio, enables the robot to identify parts of a scene that are relevant to the tasks in front of it. In Clio, the robot accepts a list of tasks described in natural language and, based on this, determines the level of detail required to interpret the environment and “remember” only those fragments of the scene that are essential.
In real-world experiments, ranging from a cluttered cubicle to a five-story building on the MIT campus, the team used Clio to automatically segment the scene at various levels of detail, based on a set of tasks specified in natural language prompts, such as “move the magazine rack” and “take “first aid kit.”
The team also operated Clio in real time on the four-legged robot. As the robot explored an office building, Clio identified and mapped only those parts of the scene that were relevant to the robot’s tasks (such as retrieving a dog toy while ignoring piles of office supplies), allowing the robot to capture objects of interest.
Clio is named after the Greek muse of history, due to her ability to recognize and remember only those elements that are relevant to a given task. Scientists predict that Clio will be useful in many situations and environments in which the robot will have to quickly examine the environment and understand it in the context of the task assigned to it.
“Search and rescue is a motivating application for this job, but Clio can also power home robots and robots that work alongside humans on the factory floor,” says Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator of the Information Systems Laboratory Decisions (LIDS) and director of the MIT SPARK Laboratory. “It’s really about helping the robot understand its environment and what it needs to remember to accomplish its mission.”
The team details their results in a study published today in the journal. Carlone’s co-authors include SPARK Lab members Dominic Maggio, Yun Chang, Nathan Hughes and Lukas Schmid; and MIT Lincoln Laboratory members Matthew Trang, Dan Griffith, Carlyn Dougherty, and Eric Cristofalo.
Open fields
Tremendous advances in computer vision and natural language processing have enabled robots to identify objects in their environment. However, until recently, robots were only able to do this in “closed set” scenarios, where they were programmed to work in a carefully selected and controlled environment, with a finite number of objects that the robot was pre-trained to recognize.
In recent years, researchers have taken a more “open” approach to enable robots to recognize objects in more realistic conditions. In the field of open set recognition, researchers have used deep learning tools to build neural networks that can process billions of images from the Internet along with text associated with each image (e.g. a photo of a friend’s dog on Facebook with the caption “Meet my novel puppy!”).
From millions of image-text pairs, the neural network learns and then identifies those segments of the scene that are characteristic of specific terms, such as dog. The robot can then apply this neural network to detect the dog in a completely new scene.
However, the challenge remains of how to analyze the scene in a useful way that is relevant to the specific task.
“Typical methods select an arbitrary, fixed level of detail to determine how to combine segments of a scene into something that can be considered a single ‘object,’” Maggio says. “But the granularity of what we call an ‘object’ is actually related to what the robot needs to do. If this granularity is fixed without taking the tasks into account, the robot may end up with a map that is not useful for its tasks.
Information bottleneck
When creating Clio, the MIT team wanted to enable robots to interpret their surroundings with a level of detail that could automatically adapt to current tasks.
For example, given the task of moving a stack of books to a shelf, the robot should be able to determine that the entire stack of books is an object relevant to that task. Similarly, if the task were to move only the green book from the rest of the stack, the robot should highlight the green book as a single target object and ignore the rest of the scene – including other books in the stack.
The team’s approach combines cutting-edge computer vision and large language models that include neural networks that create connections between millions of open-source images and semantic text. They also include mapping tools that automatically divide an image into many small segments that can be fed into a neural network to determine whether specific segments are semantically similar. Researchers then leverage an idea from classical information theory called an “information bottleneck,” which they use to compress multiple image segments in a way that selects and stores the segments that are semantically most relevant to the task at hand.
“For example, let’s say there’s a stack of books on stage and my job is just to get the green book. In this case, we push all the information about the scene through this bottleneck and get a group of segments representing the green book,” explains Maggio. “All other segments that are not relevant are simply grouped into a cluster that we can simply delete. We are left with an object with sufficient detail that is necessary to complete my task.”
The researchers demonstrated Clio in a variety of real-world environments.
“We thought it would be a really smart experiment to run Clio in my apartment, which I hadn’t cleaned before,” says Maggio.
The team made a list of tasks in natural language, such as “move a pile of clothes,” and then applied Clio to photos of Maggio’s cluttered apartment. In such cases, Clio was able to quickly segment the scenes in the apartment and feed them through an information bottleneck algorithm to identify the segments that made up the pile of clothes.
They also ran Clio on Boston Lively’s four-legged Spot robot. They gave the robot a list of tasks to complete, and as the robot explored and mapped the interior of an office building, Clio ran in real time on an on-board computer mounted in Spot to select segments from the mapped scenes that visually related to the task at hand. This method generated an overlay map showing only the target objects, which the robot then used to approach the identified objects and physically perform the task.
“Running Clio in real time was a major achievement for the team,” says Maggio. “Many previous jobs can take several hours to complete.”
In the future, the team plans to adapt Clio to perform higher-level tasks and take advantage of recent advances in photorealistic representations of visual scenes.
“We still give Clio tasks that are quite specific, such as ‘find a deck of cards,’” says Maggio. “For search and rescue, it should be assigned more higher-level tasks, such as ‘find survivors’ or ‘restoration of power.’ So we want to achieve a more human level of understanding how to perform more complex tasks.”
This research was supported in part by the U.S. National Science Foundation, the Swiss National Science Foundation, MIT Lincoln Laboratory, the U.S. Office of Naval Research, and the U.S. Army Research Lab Distributed and Collaborative Wise Systems and Technology Collaborative Research Alliance.