MIT researchers have developed an artificial intelligence-based generative approach to planning long-term visual tasks such as robot navigation that is about twice as effective as some existing techniques.
Their method uses a specialized vision-linguistic model to perceive a scenario in an image and simulate the actions necessary to achieve a goal. The second model then translates these simulations into a standard programming language for problem planning and refines the solution.
Finally, the system automatically generates a set of files that can be fed into classic planning software that calculates a plan to achieve the goal. This two-step system generated plans with an average success rate of about 70 percent, outperforming the best baseline methods, which could only achieve about 30 percent.
Importantly, the system can solve novel problems that have not been encountered before, making it perfect for real environments where conditions can change at any time.
“Our framework combines the advantages of vision-linguistic models, such as their ability to understand images, with great possibilities for planning a formal solution,” says Yilun Hao, a graduate student in aeronautics and astronautics (AeroAstro) at MIT and lead author of the book public paper on this technique. “It can take a single image and transfer it through simulation and then create a robust, long-term plan that can be useful in many real-world applications.”
She is joined in this article by Yongchao Chen, a graduate student at MIT’s Laboratory of Information and Decision Systems (LIDS); Chuchu Fan, associate professor at AeroAstro and principal investigator at LIDS; and Yang Zhang, a research associate at the MIT-IBM Watson AI Lab. The paper will be presented at the International Conference on Learning Representations.
Coping with visual tasks
Over the past few years, Fan and her colleagues have been exploring the apply of generative AI models to perform intricate reasoning and planning, often using gigantic language models (LLMs) to process text input.
Many real-world planning problems, such as robot assembly and autonomous driving, require visual inputs that LLM cannot handle on its own. Scientists have sought to expand into the visual domain by using visual-linguistic models (VLM), powerful artificial intelligence systems that can process images and text.
However, VLMs have difficulty understanding the spatial relationships between objects in a scene and often fail to reason correctly across multiple steps. This makes it tough to apply VLM for long-range planning.
On the other hand, researchers have developed resilient, formal planners that can generate effective long-term plans for intricate situations. However, these software systems cannot process visual data and require specialized knowledge to encode the problem in a language that the solver can understand.
Fan and her team have created an automated scheduling system that leverages the best of both methods. The system, called VLM-driven formal planning (VLMFP), uses two specialized VLMs that work together to turn visual planning problems into ready-to-use files for formal planning software.
The researchers first carefully trained a diminutive model, which they called SimVLM, to specialize in describing a scenario in an image using natural language and simulating the sequence of actions in that scenario. Then a much larger model, which they call GenVLM, uses the description from SimVLM to generate a set of seed files in a formal planning language known as the Planning Domain Definition Language (PDDL).
The files are ready to be entered into the classic PDDL solver, which calculates a step-by-step plan for solving the problem. GenVLM compares the solver results with the simulator results and iteratively improves the PDDL files.
“The generator and the simulator work together to achieve the exact same result, which is an action simulation that achieves the goal,” Hao says.
Since GenVLM is a gigantic generative artificial intelligence model, he saw many examples of PDDL during training and learned how this formal language can solve a wide range of problems. This existing knowledge enables the model to generate precise PDDL files.
Adaptable approach
VLMFP generates two separate PDDL files. The first is the domain file, which defines the environment, valid actions, and domain rules. It also creates a problem file that defines the starting states and goal of a specific problem.
“One of the advantages of PDDL is that the domain file is the same for all instances in that environment. This allows our framework to generalize well to unseen instances in the same domain,” explains Hao.
To enable the system to generalize effectively, researchers had to carefully design enough training data for SimVLM so that the model learned to understand the problem and goal without memorizing patterns from the scenario. During testing, SimVLM successfully described the scenario, simulated the actions, and detected whether the goal was achieved in approximately 85 percent of the experiments.
Overall, the VLMFP platform achieved a success rate of approximately 60 percent for six 2D planning tasks and over 80 percent for two 3D tasks, including multi-robot collaboration and robotic assembly. It also generated correct plans for over 50 percent of scenarios it had not seen before, significantly outperforming baseline methods.
“Our framework can generalize as the rules change in different situations. This gives our system the flexibility to solve many types of visual planning problems,” adds Fan.
In the future, researchers want to enable VLMFP to handle more intricate scenarios and explore methods for identifying and mitigating hallucinations caused by VLM.
“In the long term, generative AI models could act as agents and use the right tools to solve much more complex problems. But what does it mean to have the right tools and how to apply them? We still have a long way to go, but using visual planning makes this work an important piece of the puzzle,” says Fan.
This work was funded in part by the MIT-IBM Watson AI Lab.
