Monday, December 23, 2024

Many AI models support robots execute intricate plans more clearly

Share

Your daily to-do list is probably pretty straightforward: wash dishes, buy groceries, and other odds and ends. It’s unlikely that you’ve written down “pick up the first dirty plate” or “sponge this plate,” because each of these miniature steps in a chore seems intuitive. While we can routinely perform each step without much thought, a robot requires a intricate plan that includes more detailed outlines.

The Improbable AI Lab at MIT, a group within the Computer Science and Artificial Intelligence Laboratory (CSAIL), has given these machines a helping hand with a novel, multimodal framework: Compositional foundation models for hierarchical planning (HiP), which develops detailed, actionable plans with the expert knowledge of three different foundation models. Like OpenAI’s GPT-4, the foundation model on which ChatGPT and Bing Chat are built, these foundation models are trained on massive amounts of data for applications such as image generation, text translation, and robotics.

Unlike RT2 and other multimodal models that are trained on paired vision, language, and action data, HiP uses three different base models, each trained on different data modalities. Each base model captures a different part of the decision-making process, and then works together when it comes time to make a decision. HiP eliminates the need for access to paired vision, language, and action data, which are hard to obtain. HiP also makes the reasoning process more clear.

What is considered a daily task for a human might be a robot’s “long-term goal”—a high-level goal that involves executing many smaller steps first—requiring sufficient data to plan, understand, and execute the goals. While computer vision researchers have tried to build monolithic models of the foundations for this problem, combining language, visual, and action data is pricey. Instead, HiP represents a different, multimodal prescription: a trio that cheaply incorporates linguistic, physical, and environmental intelligence into a robot.

“The underlying models don’t have to be monolithic,” says Jim Fan, an NVIDIA AI researcher who wasn’t involved in the paper. “This work breaks down the complex task of embodied agent planning into three component models: a linguistic reasoner, a visual world model, and an action planner. It makes the difficult problem of decision-making more tractable and transparent.”

The team believes their system could support these machines perform household chores, such as putting a book away or putting a bowl in the dishwasher. Additionally, HiP could support with multi-step construction and manufacturing tasks, such as arranging and placing different materials in specific sequences.

HiP Rating

The CSAIL team tested HiP’s acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned, developing smart plans that adapt to novel information.

First, the researchers asked it to stack blocks of different colors on top of each other, then place others nearby. The catch: Some of the correct colors weren’t present, so the robot had to place white blocks in a bowl of paint to paint them. HiP often adapted to these changes accurately, especially compared to state-of-the-art task-planning systems like Transformer BC and Action Diffuser, adjusting its plans to arrange and place each square as needed.

Another test: arranging items like candy and a hammer in a brown box while ignoring other items. Some of the items it had to move were muddy, so HiP adjusted its plans to put them in a cleaning box and then in a brown container. In a third demonstration, the bot was able to ignore unnecessary items to complete kitchen subgoals like opening the microwave, moving the kettle and turning on the lithe. Some of the steps it was shown had already been completed, so the robot adjusted by skipping those instructions.

Three-level hierarchy

The HiP tripartite planning process works as a hierarchy, with the ability to pre-train each of its components on different data sets, including non-robotics information. At the bottom of this hierarchy is the gigantic language model (LLM), which starts generating ideas by capturing all the necessary symbolic information and developing an abstract task plan. Using common sense knowledge it finds on the Internet, the model breaks down its goal into subgoals. For example, “make tea” becomes “fill the kettle with water,” “boil the kettle,” and the next required actions.

“All we want to do is take existing, pre-trained models and make them communicate effectively with each other,” says Anurag Ajay, a doctoral candidate in MIT’s Department of Electrical Engineering and Computer Science (EECS) and a member of CSAIL. “Instead of pushing for one model to do everything, we’re combining multiple models that use different modalities of internet data. When used together, they help robots make decisions and could potentially help with tasks in homes, factories, and construction sites.”

These models also need some form of “eyes” to understand the environment they’re operating in and correctly execute each subgoal. The team used a gigantic video diffusion model to extend the initial planning completed by the LLM, which collects geometric and physical information about the world from online video footage. The video model, in turn, generates a plan for the observation trajectory, refining the LLM outline to incorporate the novel physical knowledge.

This process, known as iterative refinement, allows HiPs to reason about their ideas, taking feedback at each stage to generate a more workable outline. The feedback flow is similar to writing a paper, where an author might send their draft to an editor, and after incorporating those corrections, the editor checks any final changes and finalizes.

In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infers what actions should be taken based on the environment. At this point, the observation plan from the video model is mapped onto the space evident to the robot, helping the machine decide how to complete each task within the long-term goal. If the robot uses HiP to make tea, it has accurately mapped out where the pot, sink, and other key visual elements are, and will begin to complete each subgoal.

However, multimodal work is circumscribed by the lack of high-quality video baseline models. Once available, they could interact with small-scale HiP video models to further improve visual sequence prediction and robot action generation. A higher-quality version would also reduce the current data requirements of video models.

That said, the CSAIL team’s approach used only a petite amount of data. What’s more, HiP was inexpensive to train and showed the potential of using readily available baseline models to solve long-term tasks. “Anurag showed that this is a proof of concept for how we can take models trained on separate tasks and data modalities and combine them into models for robotics planning. In the future, HiP could be extended with pre-trained models that can process touch and sound to make better plans,” says senior author Pulkit Agrawal, an MIT assistant professor at EECS and director of the Improbable AI Lab. The group is also considering applying HiP to solving real-world long-term tasks in robotics.

Ajay and Agrawal are the lead authors article describing the workThey were joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and research manager at the MIT-IBM Artificial Intelligence Lab Akash Srivastava; graduate students Seungwook Han and Yilun Du ’19; former postdoc Abhishek Gupta, who is now an assistant professor at the University of Washington; and former graduate student Shuang Li PhD ’23.

The team’s work was supported in part by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).

Latest Posts

More News