One day you might want your home robot to take your muddy clothes downstairs and throw them into the washing machine in the left corner of your basement. The robot will need to combine your instructions with its visual observations to determine the steps it should take to complete this task.
With an AI agent, this is easier said than done. Current approaches often operate multiple hand-crafted machine learning models to solve different parts of a task that require a lot of effort and human knowledge to build. These methods, which operate visual representations to directly make navigational decisions, require huge amounts of visual data for training, which are often hard to obtain.
To overcome these challenges, researchers at MIT and the MIT-IBM Watson AI Lab have developed a navigation method that transforms visual representations into pieces of language, which are then fed into one huge language model that performs all parts of a multi-step navigation task.
Instead of encoding visual features from images of the robot’s environment as visual representations, which requires intensive computation, their method creates text captions describing the robot’s point of view. A huge language model uses signatures to predict the actions a robot should take to fulfill language-based user instructions.
Because their method uses purely language-based representations, they can operate a huge language model to efficiently generate huge amounts of synthetic training data.
Although this approach does not outperform visual feature techniques, it performs well in situations where there is insufficient visual data for training. Scientists have found that combining language-based input with visual cues leads to better navigation performance.
“By using only language as the perceptual representation, our approach is simpler. Because all the input can be encoded in language, we can generate a trajectory that a human can understand,” says Bowen Pan, an electrical engineering and computer science (EECS) graduate student and lead author of the book article about this approach.
Your co-authors include his advisor Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, director of the MIT-IBM Watson AI Lab, and senior research fellow at the Computer Science and Artificial Intelligence Laboratory (CSAIL ); Philip Isola, EECS associate professor and CSAIL member; senior author Yoon Kim, assistant professor at EECS and CSAIL member; and others at the MIT-IBM Watson AI Lab and Dartmouth College. The research results will be presented at the Conference of the North American Branch of the Association for Computational Linguistics.
Solving vision problems with language
Because large language models are the most powerful machine learning models available, researchers have sought to incorporate them into a complex task called visual-linguistic navigation, Pan says.
However, such models take text data and cannot process visual data from the robot’s camera. So the team had to find a way to use the language.
Their technique uses a simple caption model to obtain textual descriptions of the robot’s visual observations. These signatures are combined with language instructions and fed into a large language model that decides what navigation step the robot should take.
The large language model displays the caption of the scene the robot should see when it completes this step. It is used to update the trajectory history so that the robot can keep track of where it has been.
The model repeats these processes to generate a trajectory that guides the robot to its goal, step by step.
For example, the caption might read: “On the left, at a 30-degree angle, there is a door with a potted plant next to it, behind you is a small office with a desk and a computer,” etc. The model chooses whether the robot should move towards the door or office.
“One of the biggest challenges was finding a way to encode this type of information into the language in the right way so that the agent understood what the task was and how it should react,” says Pan.
Advantages of language
When they tested this approach, although it was no more effective than vision-based techniques, they found that it had several advantages.
First, since text synthesis requires less computational resources than sophisticated image data, their method can be used to quickly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real, visual trajectories.
This technique can also fill a gap that may prevent an agent trained in a simulated environment from performing well in the real world. This gap often occurs because computer-generated images can look very different from real-world scenes due to elements such as lighting or color. However, the language describing the synthetic and real images would be much more hard to distinguish, says Mr.
Additionally, the representations their model uses are easier for humans to understand because they are written in natural language.
“If an agent fails to achieve its goal, we can more easily determine where and why it failed. Perhaps the historical information is not clear enough or the observations ignore some important details,” Pan says.
Additionally, their method can be more easily applied to different tasks and environments because it uses only one type of input. As long as the data can be encoded in a language, it can use the same model without making any modifications.
However, the disadvantage of this method is that their method naturally loses some information that could be captured by vision-based models, such as depth information.
However, researchers were surprised to see that combining language-based representations with vision-based methods improved the agent’s ability to navigate.
“Perhaps this means that language can capture higher-level information that cannot be captured by purely visual functions,” he says.
This is one area that researchers want to pursue further. They also want to develop navigation-oriented captions that could boost the efficiency of this method. Additionally, they want to explore the ability of huge language models to demonstrate spatial awareness and see how this could aid with language-based navigation.
This research is funded in part by the MIT-IBM Watson AI Lab.