For roboticists, one challenge trumps all others: generalization—the ability to create machines that can adapt to any environment and condition. Since the 1970s, the field has evolved from writing sophisticated programs to applying deep learning, teaching robots to learn directly from human behavior. However, a critical bottleneck remains: data quality. To improve, robots must face scenarios that push the limits of their capabilities, performing at the limits of their mastery. The process traditionally requires human supervision, with operators carefully challenging the robots to develop their skills. As robots become more sophisticated, this practical approach faces a scaling problem: the demand for high-quality training data far outstrips the ability of humans to provide it.
Now, a team of researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed a novel approach to robot training that could significantly accelerate the deployment of adaptable, knowledgeable machines in real-world environments. The novel system, the so-calledLucidSim” leverages the latest advances in generative artificial intelligence and physics simulators to create diverse and realistic virtual training environments, helping robots achieve expert performance on challenging tasks without any real-world input.
LucidSim combines physics simulation with generative artificial intelligence models, solving one of the most persistent challenges in robotics: transferring skills learned in simulation to the real world. “A fundamental challenge in robot learning has long been the ‘simulation-to-reality gap’ — the discrepancy between simulated training environments and the complex, unpredictable real world,” says MIT CSAIL postdoc Ge Yang, principal investigator of LucidSim. “Previous approaches often relied on depth sensors, which simplified the problem but missed key real-world complexities.”
The multi-track system is a combination of various technologies. At its core, LucidSim uses vast language models to generate various structured descriptions of environments. These descriptions are then transformed into images using generative models. To ensure that these images reflect real-world physics, a basic physics simulator was used in the generation process.
The birth of an idea: from burrito to breakthroughs
The inspiration for LucidSim came from an unexpected place: conversations outside Beantown Taqueria in Cambridge, Massachusetts. “We wanted to teach vision-enabled robots how to improve themselves using feedback from humans. But then we realized that we didn’t have a purely vision-driven policy from the beginning,” says Alan Yu, an undergraduate electrical engineering and computer science (EECS) student at MIT and co-author of LucidSim. “We talked about it as we walked down the street and then stopped at a taqueria for about half an hour. That’s where we had our moment.”
To prepare the data, the team generated realistic images by extracting depth maps from the simulated scene containing geometric information and semantic masks marking different parts of the image. However, they quickly realized that with tight control over the image content composition, the model would generate similar images that did not differ from each other using the same cue. So they developed a way to get a variety of text prompts from ChatGPT.
However, this approach only produced one image. To record tiny, coherent videos that serve as little “experiences” for the robot, the researchers combined the magic of images in another novel technique the team developed, called “Dreams In Motion.” The system calculates the movements of each pixel between frames to turn a single generated image into a tiny, multi-frame video. Dreams In Motion does this by taking into account the 3D geometry of the scene and the relative changes in the robot’s perspective.
“We are outperforming domain randomization, a method developed in 2017 that applies random colors and patterns to objects in the environment, which is still considered the most popular method,” Yu says. “Although this technique generates a variety of data, it lacks realism. LucidSim solves both diversity and realism problems. It’s stimulating that even without seeing the real world during training, the robot can recognize and navigate obstacles in the real environment.”
The team is particularly excited about the potential for LucidSim to be used in domains beyond quadruped locomotion and parkour, which are their primary testbed. One example is mobile manipulation, in which a mobile robot is tasked with manipulating objects in an open space; color perception is also critical. “Today, these robots still learn from real-world demonstrations,” Yang says. “While collecting demonstrations is easy, scaling a real-world robot teleoperation setup to thousands of skills is challenging because each scene must be physically prepared by a human. We hope to make this easier and therefore more qualitatively scalable by moving data collection to a virtual environment.”
Who is the real expert?
The team put LucidSim to the test against an alternative in which an experienced teacher demonstrates skills from which the robot can learn. The results were surprising: expert-trained robots struggled, succeeding only 15 percent of the time, and even quadrupling the amount of expert-trained data barely moved the needle. But when the robots collected their own training data through LucidSim, the situation changed dramatically. Just doubling the size of the dataset increased the success rate to 88%. “And feeding our robot more data monotonically improves its performance — ultimately, the learner becomes the expert,” Yang says.
“One of the main challenges in translating simulation to reality in robotics is achieving visual realism in simulated environments,” says Shuran Song, an assistant professor of electrical engineering at Stanford University who was not involved in the research. “The LucidSim framework provides an elegant solution by using generative models to create diverse, highly realistic visual data for any simulation. This work could significantly accelerate the deployment of robots trained in virtual environments for real-world tasks.”
From the streets of Cambridge to the cutting-edge of robotics research, LucidSim is paving the way for a new generation of intelligent, flexible machines—ones that learn to navigate our complex world without ever entering it.
Yu and Yang wrote the paper with four other CSAIL collaborators: Ran Choi, an MIT graduate student in mechanical engineering; Yajvan Ravan, MIT student majoring in EECS; John Leonard, the Samuel C. Collins Professor of Mechanical and Oceanic Engineering in the MIT Department of Mechanical Engineering; and Phillip Isola, MIT associate professor in EECS. Their work was supported in part by a Packard Fellowship, a Sloan Research Fellowship, the Office of Naval Research, the Singapore Defense Science and Technology Agency, Amazon, MIT Lincoln Laboratory, and the National Science Foundation’s Institute for Artificial Intelligence and Fundamental Interactions. The researchers presented their work at the Conference on Robotic Learning (CoRL) in early November.