Could AI tell you where you left your keys?

Share

A car factory worker remembers the container where he left a partially assembled component the night before and quickly returns to that location to pick it up. However, robots that can work side-by-side with it would have difficulty developing and accessing the same type of “space-time” memory.

Now, MIT researchers have developed a long-term memory structure that enables robots to quickly create and recall a detailed mental model of intricate, large-scale environments.

In the future, this improvement could allow a factory worker to send a robot assistant to fetch an item by simply asking it to “go and grab the component we started assembling last night.”

This modern method combines advanced map representations with luxurious descriptions of the environment that the robot collects during its journey over long periods of time. The robot can quickly access this memory to answer intricate queries about its environment in plain language.

This memory structure, which answers questions more accurately than state-of-the-art methods, runs rapid enough for a mobile robot to exploit it in real time.

In addition to potential applications in robotics, the method could have applications in augmented reality systems that support maintenance workers detect anomalies or support commuters find their way.

“If we want robots to work side by side with humans and interact better with humans, they need to speak the same language. The robot must be able to reason in time and space in the same way as humans. That’s essentially what our method achieves. It turns a traditional map into a language-based map that the robot can think about and access more easily using language,” says Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics (AeroAstro), principal investigator at the Laboratory for Information and Decision Systems (LIDS) and director of the MIT SPARK Laboratory.

She joins him paper by lead author Nicolas Gorlo, MIT graduate; and Lukas Schmid, former research fellow at MIT and now professor at the Technical University of Nuremberg in Germany. The study’s results were recently presented at the Computer Vision and Pattern Recognition (CVPR) conference.

Space-time memory

Memory allows an artificial intelligence system, such as a chatbot, to answer intricate questions and reason about previous interactions with the user.

“We want to design a new type of memory, spatiotemporal memory, that will enable an AI-based robot to remember real interactions and observations from sensors. Similar to ChatGPT, but embedded in the real world and able to answer any question about the environment, e.g.: “Where did I leave my wallet?” – says Carlone.

To develop this memory framework, MIT researchers combined two fields of work: computer vision and robotic mapping.

Multimodal computer vision models can understand and describe objects in a scene in detail, but often only process one annotation at a time. On the other hand, robotic mapping platforms create 3D maps of an environment, such as an entire apartment or a university campus, but they usually lack detailed descriptions of objects or are computationally expensive.

A method created by MIT researchers, called Describe Anything, Anywhere, Anytime (DAAAM), takes the best of both approaches.

Using DAAAM, the robot, as it traverses its surroundings, attaches rich descriptions to the objects it sees. For example, a robot might notice that a particular building on the MIT campus is called the Stata Center and has a certain architecture, or that a bike rack holds five bikes and the red one has a flat tire.

It stores this detailed information in a 3D map-based form that is spatially arranged so that objects are grouped into separate regions. This way, the robot will remember that the red bike with a flat tire is on the bike rack in front of the Stata Center.

However, existing techniques to capture such rich descriptions typically require several seconds to annotate several objects. This is too slow to provide real-time performance, as the robot can see hundreds of objects in a few minutes of exploration.

“The faster the robot can create this spatial memory, the more efficiently it will perform activities in the environment,” adds Carlone.

Process improvement

To speed up performance, DAAAM aggregates nearby objects as it travels and uses an optimization method to select key frames for annotation. These are images with the clearest image of multiple objects, allowing the system to accurately describe several objects in parallel, speeding up calculations tenfold.

As the robot explores space, it attaches each batch of annotations to multiple objects at a specific location on the 3D map.

“We annotate each object only once, so our framework can operate in very large-scale environments in real time. And by grouping objects into regions, it can answer a wide range of queries about objects and locations in the environment,” explains Gorlo.

Once the system has built this spatial memory, it must efficiently extract information from a huge database of objects and descriptions.

To make this possible, researchers used LLM, which requires a variety of tools that can quickly obtain specific information in a way that reduces hallucinations. This allows DAAAM to accurately respond to a user’s query in just a few seconds.

For example, if someone asks the robot about a particular sculpture it saw near a building on the MIT campus, DAAAM could use a semantic search tool to retrieve information based on the word “sculpture” or another tool to retrieve information based on the building’s location.

When tested and compared to other methods, the DAAAM score was 21 to 53 percent more accurate, depending on the question type.

In the future, researchers want to expand DAAAM so that the system can record significant events that occur in the environment. They are also working to incorporate trust levels into the system’s responses.

“Ultimately, we want to have robots that can support with any task. With this structure, we’re trying to lay the groundwork to enable a universal agent that can do anything you ask,” Gorlo says.

This research was funded in part by the U.S. Army Research Laboratory and the Office of Naval Research. Carlone is currently on sabbatical as an Amazon Fellow; This article describes work performed at MIT and is not affiliated with Amazon.

Latest Posts

More News