Saturday, March 7, 2026

Using generative artificial intelligence to diversify virtual training grounds for robots

Share

Over the past three years, the operate of chatbots like ChatGPT and Claude has skyrocketed because they can support with a wide range of tasks. Whether you’re writing Shakespeare’s sonnets, debugging code, or needing an answer to an undiscovered question, AI systems seem to have you covered. The source of this versatility? Billions or even trillions of text data on the Internet.

However, this data is not enough to teach the robot to be a helpful helper in the household or in the factory. To understand how to handle, arrange, and place different arrangements of objects in different environments, robots need demonstrations. You can think of a robot’s training data as a collection of instructional videos that walk the system through each movement of a task. Collecting these demonstrations on real robots is time-consuming and not perfectly repeatable, so engineers created the training data by generating simulations using artificial intelligence (which often do not reflect real-world physics) or painstakingly developing each digital environment from scratch by hand.

Scientists at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Toyota Research Institute may have found a way to create the diverse, realistic training field that robots need. Their "controlled scene generation“The approach creates digital scenes of objects such as kitchens, living rooms and restaurants that engineers can use to simulate a wide range of real-world interactions and scenarios. Trained in over 44 million 3D rooms filled with models of objects such as tables and plates, the tool places existing assets in new scenes and then refines each of them, physically creating accurate, realistic environment.

In one particularly telling experiment, MCTS added the maximum number of objects to a simple restaurant scene. There were as many as 34 objects on the table, including huge stacks of dim sum dishes, after training on stages that averaged only 17 objects.

Guided scene generation also allows you to generate a variety of training scenarios through reinforcement learning – essentially training a diffusion model to meet a goal through trial and error. Once trained on the initial data, the system goes through a second stage of training where you assign a reward (basically a desired outcome with a score indicating how close you are to that goal). The model automatically learns to create scenes with higher scores, often creating scenarios that are significantly different from the ones it was trained on.

Users can also directly prompt the system by entering specific visual descriptions (e.g. “kitchen with four apples and a bowl on the table”). Then, controlled scene generation can precisely implement your requests. For example, the tool accurately followed users’ directions at a rate of 98% when creating scenes of pantry shelves and 86% for messy breakfast tables. Both ratings represent at least a 10 percent improvement over comparable methods such as “MiDiffusion“And”Distracted

The system can also create specific scenes using prompts or light cues (e.g., “think of a different arrangement of the scene using the same objects”). For example, you can ask him to arrange apples on several plates on the kitchen table or to put board games and books on the shelf. This is essentially “fill in the blanks” by placing objects in empty spaces but keeping the rest of the scene intact.

According to the researchers, the strength of their project lies in its ability to create multiple scenes that roboticists can actually use. “The key takeaway from our findings is that it’s okay if the scenes we trained for don’t exactly resemble what we actually want,” says Pfaff. “Using our control methods, we can go beyond this broad distribution and sample from the ‘better’. In other words, generate a variety of realistic and task-specific scenes that we actually want to train our robots on.”

Such vast scenes became a testing ground where it was possible to record a virtual robot interacting with various objects. For example, the machine carefully placed forks and knives in the cutlery holder and then arranged bread on plates in various 3D settings. Each simulation felt smooth and realistic, reminiscent of the real world, and generating scenes through adaptive robots and controls could one day aid in training.

While the system could be an encouraging avenue for generating a wide variety of training data for robots, the researchers say their work is more of a proof-of-concept. In the future, they would like to use generative AI to create completely new objects and scenes, rather than using a fixed library of assets. They also plan to include articulated objects that the robot can open or twist (such as cabinets or jars filled with food) to make the scenes even more interactive.

To make their virtual environments even more realistic, Pfaff and his colleagues can incorporate real-world objects, using a library of objects and scenes pulled from photos on the Internet and building on their previous work on “Scalable Real2Sim” By increasing the variety and realism of robot test sites constructed using artificial intelligence, the team hopes to build a community of users that will create a wealth of data that can then be used as a massive dataset to teach skillful robots various skills.

“Today, creating realistic scenes for simulation can be quite a challenging endeavor; procedural generation can easily generate a enormous number of scenes, but they will likely not be representative of the environments the robot will encounter in the real world. Creating custom scenes by hand is both time-consuming and costly,” says Jeremy Binagia, an applied scientist at Amazon Robotics, who has not participation in the creation of the article. “Guided scene generation offers a better approach: train a generative model on a enormous set of pre-existing scenes and adapt it (using a strategy such as reinforcement learning) for specific downstream applications. Compared to previous works that operate an off-the-shelf visual language model or focus only on arranging objects in a 2D mesh, this approach guarantees physical feasibility and takes into account full translation and rotation 3D, enabling the generation of much more compelling scenes.”

“Guided scene generation with subsequent training and inference search provides a novel and efficient platform for automating large-scale scene generation,” says Rick Cory SM ’08, PhD ’10, a worker at Toyota Research Institute who was also not involved in the paper. “Furthermore, it can generate ‘never-before-seen’ scenes that will be considered important for further tasks.” In the future, combining this framework with massive internet data could represent an crucial milestone towards effectively training robots for deployment in the real world.”

Pfaff wrote the paper with senior author Russ Tedrake, the Toyota Professor of Electrical Engineering and Computer Science, Aeronautics and Astronautics, and Mechanical Engineering at MIT; Senior Vice President of Gigantic Behavior Models at Toyota Research Institute; and Principal Investigator of CSAIL. Other authors included Toyota Research Institute robotics researcher Hongkai Dai SM ’12, PhD ’16; team leader and senior researcher Sergei Zakharov; and Carnegie Mellon University doctoral student Shun Iwase. Their work was supported in part by Amazon and the Toyota Research Institute. The researchers presented their work at the Conference on Robotic Learning (CoRL) in September.

Latest Posts

More News