Imagine visiting a friend abroad and looking into their fridge to see what could make a delicious breakfast. Many of the items seem foreign to you at first, and each of them is locked in unfamiliar packaging and containers. Despite these visual differences, you come to understand what each one does and pick them up when needed.
Inspired by humans’ ability to deal with unknown objects, a group at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) designed feature fields for robotic manipulation (F3RM), a system that combines 2D images with base model features into 3D scenes to assist robots identify and grab nearby objects. F3RM can interpret open language prompts spoken by humans, making the method useful in real environments containing thousands of objects, such as warehouses and households.
F3RM offers robots the ability to interpret open text messages using natural language, helping machines manipulate objects. As a result, machines can understand less specific human requests and still perform the desired task. For example, if a user asks the robot to “pick up a tall cup,” the robot can locate and grab the item that best fits that description.
“Creating robots that can actually generalize in the real world is extremely difficult,” says Ge Yang, a postdoc at the National Science Foundation’s Institute for Artificial Intelligence and Fundamental Interactions and MIT CSAIL. “We really want to figure out how to do this, so in this project we’re trying to push for an aggressive level of generalization, from just three or four objects to everything we find in MIT’s Stata Center. We wanted to figure out how to make robots as flexible as we are, because we can grab and place things even if we’ve never seen them before.”
Learning “what is where by looking”
This method could assist robots pick items in huge logistics centers, where chaos and unpredictability are inevitable. In such warehouses, robots are often given a description of the inventory they need to identify. Robots must match the text delivered to the object, regardless of packaging differences, so that customer orders are shipped correctly.
For example, the fulfillment centers of major online retailers may contain millions of items, many of which the robot has never encountered before. To operate at this scale, robots must understand the geometry and semantics of various objects, and some find themselves in tight spaces. Thanks to F3RM’s advanced spatial and semantic perception capabilities, the robot can more efficiently locate objects, place them in a basket, and then send them for packaging. Ultimately, this would assist factory workers ship customer orders more efficiently.
“One thing that often surprises people using F3RM is that the same system also works at room and building scales and can be used to create simulation environments for robot learning and large maps,” Yang says. “But before we scale this up further, we first want to get this system working really fast. “In this way, we can use this type of representation for more dynamic robot control tasks, hopefully in real time, so that robots performing more dynamic tasks can use it for perception.”
The MIT team notes that F3RM’s ability to understand different scenes could make it useful in urban and home environments. This approach could, for example, assist personalized robots identify and pick up specific objects. The system helps robots understand their surroundings – both physically and perceptually.
“David Marr defined visual perception as the problem of seeing ‘what goes where,’” says senior author Phillip Isola, associate professor of electrical engineering and computer science at MIT and principal investigator of CSAIL. “Recent models have become really good at knowing what they are looking at; they can recognize thousands of object categories and provide detailed text descriptions of images. At the same time, radiation fields have become really good at representing where elements in a scene are. Combining these two approaches can create a 3D representation of what is where, and our work shows that this combination is particularly useful for robotic tasks that require the manipulation of objects in 3D.”
Creating a “digital twin”
F3RM begins to understand his surroundings by taking photos with a selfie stick. The installed camera takes 50 photos in various poses, which allows you to build neural radiation field (NeRF), a deep learning method that takes 2D images to construct a 3D scene. This RGB photo collage creates a “digital twin” of the environment as a 360-degree representation of what’s nearby.
In addition to a highly detailed neural radiation field, F3RM also creates a feature field to augment the geometry with semantic information. The system uses CLIP, a basic vision model trained on hundreds of millions of images to effectively learn visual concepts. By reconstructing 2D CLIP features for photos taken with a selfie stick, F3RM effectively transfers 2D features to 3D representations.
Leaving things open
After several demonstrations, the robot uses its knowledge of geometry and semantics to grasp objects it has never encountered before. When the user submits a text query, the robot searches the space of possible grabs to identify those that are most likely to pick up the object the user wants. Each potential option is evaluated based on its relevance to the prompt, its similarity to the demonstrations the robot was trained on, and whether it causes any collisions. The hold with the highest score is then selected and performed.
To demonstrate the system’s ability to interpret unlimited human requests, researchers asked the robot to pick up Baymax, a character from Disney’s “Big Hero 6.” Although F3RM was never directly trained to pick up a cartoon superhero toy, the robot used its spatial awareness and visual language features from basic models to decide which object to grab and how to pick it up.
F3RM also allows users to specify what object the robot should handle at various levels of linguistic detail. For example, if there is a metal cup and a glass cup, the user can ask the robot for a “glass cup”. If the bot sees two glass mugs and one of them is filled with coffee and the other is filled with juice, the user can request a “glass coffee mug”. The base model features embedded in the feature field enable this level of open understanding.
“If I showed a person how to lift a cup by the lip, they could easily transfer that knowledge to lifting objects with similar geometry, such as bowls, measuring cups, or even rolls of tape. In the case of robots, achieving this level of adaptability was quite a challenge,” says MIT graduate student, CSAIL collaborator and co-author William Shen. “F3RM combines geometric understanding with semantics based on base models trained on web-scale data to enable this level of aggressive generalization from just a small number of demonstrations.”
Shen and Yang wrote the paper under Isola’s supervision, and co-authors included MIT professor and CSAIL principal investigator Leslie Pack Kaelbling and undergraduate students Alan Yu and Jansen Wong. The team was supported in part by Amazon.com Services, the National Science Foundation AI Institute for Artificial Intelligence Fundamental Interactions, the Air Force Office of Scientific Research, the Office of Naval Research Multidisciplinary University Initiative, the Office of Army Research, the MIT-IBM Watson AI Lab, and the MIT Quest for Intelligence . Their work will be presented at the 2023 Robot Learning Conference.