Meta's DreamGym platform trains AI agents in a simulated world to reduce the cost of reinforcement learning

Researchers from Meta, the University of Chicago, and the University of California, Berkeley, have developed a recent framework that addresses the high costs, infrastructure complexity, and unreliable feedback associated with using reinforcement learning (RL) to train agents in gigantic language models (LLMs). frames, DreamGymsimulates an RL environment to train agents on elaborate applications. As the training process progresses, the framework dynamically adjusts the task difficulty, ensuring that the agent gradually learns to solve increasingly hard problems as it improves.

The research team’s experiments show that DreamGym significantly improves RL training in both fully synthetic settings and in scenarios where the model must apply simulated learning to the real world. In environments where RL is possible but exorbitant, it matches the performance of popular algorithms using only synthetic interactions, significantly reducing the costs of data collection and interaction with the environment.

This approach can be critical for enterprises, enabling them to train agents on tailored applications while avoiding the complexities associated with configuring and maintaining running RL environments.

The challenge of training LLM agents

Reinforcement learning is a key technique for LLM training in handling elaborate tasks in agent-based environments such as web navigation, tool utilize and robotics. It enables models to learn from direct interaction and experience, going beyond the stationary datasets used in the pre-training process.

However, RL in agent training remains hard. Real-world applications often involve long sequences of actions with limited signals, which means that the agent only receives a positive signal after a long and correct sequence of actions.

Collecting sufficiently diverse and validated data is also exorbitant and often requires experts to verify tasks and describe results. The infrastructure required to create live environments for large-scale RL training may be prohibitively elaborate and exorbitant. Not to mention, interacting with running systems carries risks, as inappropriate actions (such as deleting a file) can cause irreversible damage.

“These limitations make building universal and scalable systems for training agents with RL an open and urgent challenge,” the researchers write.

DreamGym directly challenges this model by providing comparable performance entirely in simulation, eliminating the infrastructure burdens that keep most enterprises from adopting RL, and providing teams with a practical path to train agents without touching exorbitant or risky live environments.

How DreamGym works

The researchers describe DreamGym as “a unified and scalable RL platform that synthesizes diverse experience data in an online manner to enable efficient and effective LLM agent training.” It is built around three core elements that work together to create a controlled and effective training loop.

The first component is a “reasoning-based experience model” that translates the dynamics of the target environment into text space. This model serves as an application environment simulator. Instead of interacting with a costly real environment, the agent interacts with this model, which generates consistent state transitions and feedback based on the agent’s actions.

The researchers argue that training agents does not require perfectly realistic environments, but rather data that is “sufficiently diverse, informative, and causally grounded.” For example, for an online shopping task, the model synthesizes pristine lists of items on a page rather than processing raw HTML. This abstract approach makes training the experience model very proficient and requires only a diminutive amount of public data.

The second element is the “experience replay buffer”, which acts as a vigorous memory. At the beginning of the training process, the buffer is populated with offline data to provide the necessary context and is continuously updated with recent synthetic trajectories generated during training. This buffer helps guide the experience model’s predictions, ensuring that synthetic experiences remain diverse and factual.

The third component, the “curriculum task generator,” works with the experience model to adaptively create recent tasks that become increasingly more hard. It identifies tasks where the agent’s performance varies (signaling that they are hard but solvable) and generates changes to enhance the agent’s capabilities.

Together, these components create a closed-loop system for scalable agent training. “By unifying interaction, memory, and adaptive task generation online, DreamGym addresses persistent challenges that limit RL for LLM agent training: prohibitive costs, scarcity of diverse tasks, unstable reward signals, and high infrastructure requirements,” the researchers say.

DreamGym in action

Researchers evaluated DreamGym against several agent benchmarks, including WebShop (e-commerce), ALFWorld (embodied control), and WebArena (realistic web interaction). They used Lama 3 AND Qwen 2.5 models as agent skeletons and compared DreamGym to several customary training strategies. These included offline methods such as supervised tuning (SFT) and direct preference optimization (DPO), as well as online RL algorithms such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), which improve agents by interacting in a live environment.

DreamGym has shown its greatest advantage in environments such as WebArena, where establishing a large-scale RL infrastructure is hard. Agents trained entirely in DreamGym achieved success rates over 30% higher than baseline methods, which struggled with insufficient rewards and confined exploration in a real-world environment. The researchers said this shows that DreamGym is a mechanism that makes RL training “feasible in domains that were previously unfeasible due to inherent task and engineering constraints.”

In environments where RL is supported but costly, agents trained with DreamGym performed on par with agents trained with GRPO and PPO, but without any costly interactions with the external environment. The team also introduced a reality simulation approach, DreamGym-S2R, in which the agent is first trained in a synthetic environment and then fine-tuned on a diminutive amount of real-world data. This strategy resulted in over 40% performance improvement compared to training from scratch in a real-world environment using less than 10% external data. This provides a scalable scorching start for training general-purpose agents.

Finally, the framework demonstrated powerful generalizability. An agent trained in tasks in one domain, e.g. WebShop, can successfully transfer the acquired skills to another, e.g. WebArena. The researchers suggest that this is because DreamGym agents learn in an “abstract meta-representation space, which enables the agent to learn domain-independent behavioral priorities rather than memorizing task-specific patterns.”

While the DreamGym project is still in its early stages, it demonstrates that simulated environments can provide huge benefits for agent training. In practice, an enterprise could collect a diminutive number of trajectories and descriptions of the tasks it wants to automate. It can then utilize this diminutive seed to load DreamGym frameworks for scalable and proficient agent training.

Categories

Meta’s DreamGym platform trains AI agents in a simulated world to reduce the cost of reinforcement learning

The challenge of training LLM agents

How DreamGym works

DreamGym in action

Some jurors in Musk v. Altman don’t like Elon Musk

10 Python Libraries for Building LLM Applications

Elon Musk amplifies Fresh York’s Sam Altman’s statement about X as the trial begins

The war in Iran is affecting the environment in undetectable ways

The man behind AlphaGo believes that artificial intelligence is heading down the wrong path

More News

Some jurors in Musk v. Altman don’t like Elon Musk

Elon Musk amplifies Fresh York’s Sam Altman’s statement about X as the trial begins

The man behind AlphaGo believes that artificial intelligence is heading down the wrong path

Here’s how much San Francisco tech companies pay for police protection

Some jurors in Musk v. Altman don’t like Elon Musk

10 Python Libraries for Building LLM Applications

Elon Musk amplifies Fresh York’s Sam Altman’s statement about X as the trial begins