The recent memory structure creates AI agents that can cope with the unpredictability of the real world

Scientists from University of Illinois at Urbana-Champaign AND Google Cloud AI research have developed a platform that enables enormous language model (LLM) agents to organize their experiences into a memory bank, which helps them better cope with elaborate tasks over time.

The framework of the so-called ReasoningBankdistinguishes “generalized reasoning strategies” based on the agent’s successful and unsuccessful attempts to solve problems. The agent then uses this memory during inference to avoid repeating past mistakes and make better decisions when faced with recent problems. Scientists show that when combined with test time scaling techniqueswhere an agent repeatedly tries to solve a problem, ReasoningBank significantly improves the efficiency and effectiveness of LLM agents.

Their findings show that ReasoningBank consistently outperforms classical memory engines in web browsing and software engineering benchmarks, offering a practical path toward creating more adaptive and reliable AI agents for enterprise applications.

The LLM Agent Memory Challenge

As LLM agents are deployed in long-running applications, they encounter a constant stream of tasks. One of the key limitations of current LLM agents is the lack of learning from accumulated experience. By approaching each task individually, they inevitably repeat past mistakes, discard valuable insights from related problems, and fail to develop skills that would escalate their capabilities over time.

The solution to this limitation is to provide agents with some type of memory. Previous efforts to provide agents with memory have focused on storing past interactions for reuse by organizing information in various forms, from plain text to structured graphs. However, these approaches often fail. Many of them utilize raw interaction logs or only store examples of successful tasks. This means that they are unable to extract transferable patterns of higher-level reasoning and, most importantly, they do not extract and utilize valuable information from the agent’s failures. As the researchers note in their paper, “existing memory designs often limit themselves to passively storing records rather than providing practical and generalizable guidance for future decisions.”

How ReasoningBank works

ReasoningBank is a memory platform designed to overcome these limitations. Its main idea is to distill useful strategies and reasoning cues from past experiences into structured memory elements that can be stored and reused.

According to Jun Yan, a researcher at Google and co-author of the paper, this represents a fundamental change in the way agents operate. “Traditional agents operate statically – each task is processed separately,” Yan explained. “ReasoningBank changes this by transforming every task experience (successful or unsuccessful) into a structured, reusable reasoning memory. As a result, the agent doesn’t start from scratch with every customer; it recalls and adapts proven strategies from similar past cases.”

The framework takes both successful and unsuccessful experiences and turns them into a set of actionable prevention strategies and lessons. The agent evaluates success and failure through evaluation LLM Programs-as a Judge to avoid the need for human labeling.

Yan provides a practical example of this process in action. An agent tasked with finding Sony headphones may fail because his broad query returns over 4,000 irrelevant products. “ReasoningBank will first try to find out why this approach failed,” Yan said. “It will then explore strategies such as “search term optimization” and “product restriction using category filtering.” These strategies will be extremely useful to successfully complete similar tasks in the future.”

The process runs in a closed loop. When an agent is faced with a recent task, it uses embedding-based search to retrieve relevant memories from ReasoningBank and guide its actions. These memories are inserted into the agent’s system prompts, providing context for its decision-making. Once a task is completed, the structure creates recent memory elements to extract lessons from successes and failures. This recent knowledge is then analyzed, distilled, and incorporated into ReasoningBank, allowing the agent to continually evolve and improve its capabilities.

Boosting memory by scaling

Scientists have discovered a powerful synergy between memory and test time scaling. Classic test time scaling involves generating multiple independent answers to the same question, but the researchers say this “vanilla form is suboptimal because it does not take advantage of the inherent contrast signal that arises from redundant exploration of the same problem.”

To solve this problem, they propose memory-aware test time scaling (MaTTS), which integrates scaling with ReasoningBank. MaTTS comes in two forms. With “parallel scaling,” the system generates multiple trajectories for the same query and then compares and contrasts them to identify consistent patterns of reasoning. In sequential scaling, the agent iteratively refines its reasoning within a single trial, with intermediate notes and corrections also serving as valuable memory signals.

This creates a virtuous circle: the existing memory in ReasoningBank steers the agent towards more promising solutions, while the diverse experiences generated by scaling enable the agent to create higher quality memories for storage in ReasoningBank.

“This positive feedback loop positions memory-based experience scaling as a new dimension of scaling for agents,” the researchers write.

ReasoningThe bank in action

The researchers tested their framework on Network (browsing the Internet) i Verified in SWE (software engineering) benchmarks using models such as Google Gemini 2.5 Pro and Claude 3.7 Sonnet by Anthropic. They compared ReasoningBank against baselines including memoryless agents and agents using trajectory-based or workflow-based memory structures.

The results show that ReasoningBank consistently outperforms these baselines across all datasets and LLM frameworks. In WebArena, it improved the overall success rate by as much as 8.3 percentage points compared to the bare-bones agent. It also enables better generalization of more hard tasks spanning multiple domains while reducing the number of interaction steps needed to complete the tasks. When combined with MaTTS, both parallel and sequential scaling further improved performance, consistently outperforming standard scaling throughout the test.

This escalate in efficiency has a direct impact on operating costs. Yan points to a case where a memory-free agent performed eight trial-and-error steps to find the appropriate product filter on a website. “These trial-and-error costs could be avoided by leveraging the right insights from ReasoningBank,” he noted. “In this case, we save almost twice as much in operational costs,” which also improves the user experience by resolving issues faster.

For enterprises, ReasoningBank can aid you develop cost-effective agents that can learn from experience and adapt over time in elaborate workflows and areas such as software development, customer service, and data analytics. As the paper concluded: “Our findings suggest a practical path toward building adaptive and lifelong learning agents.”

Yan confirmed that their findings point to the future of truly compositional intelligence. For example, a coding agent might learn distinct skills, such as API integration and database management, from separate tasks. “Over time, these modular skills… become elements that an agent can flexibly combine to solve more complex tasks,” he said, suggesting a future in which agents can autonomously gather their knowledge to manage entire workflows with minimal human supervision.

Categories

The recent memory structure creates AI agents that can cope with the unpredictability of the real world

The LLM Agent Memory Challenge

How ReasoningBank works

Boosting memory by scaling

ReasoningThe bank in action

When the robots have their moment in GPT chat, remember these tongs

With the wave of a magnet, microscopic “magno-bots” perform convoluted maneuvers

Enabling privacy-preserving AI training on everyday devices

Britain’s answer to Darpa wants to reprogram the human brain

OpenAI really wants Codex to stop talking about goblins

More News

When the robots have their moment in GPT chat, remember these tongs

OpenAI really wants Codex to stop talking about goblins

Elon Musk Testifies He Launched OpenAI to Prevent ‘Terminator Outcome’

‘It’s undignified’: Hundreds of workers training Meta’s artificial intelligence could be fired

When the robots have their moment in GPT chat, remember these tongs

With the wave of a magnet, microscopic “magno-bots” perform convoluted maneuvers

Enabling privacy-preserving AI training on everyday devices