Tuesday, March 10, 2026

Alibaba’s AgentEvolver increases tooling model performance by ~30% with synthetic, auto-generated tasks

Share

Researchers at Alibaba’s Tongyi Lab have developed a fresh platform for self-evolving agents that create their own training data by exploring their application environments. frames, AgentEvolverleverages the knowledge and reasoning capabilities of gigantic language models for autonomous learning, eliminating the high cost and manual effort typically required to collect task-specific datasets.

Experiments show that compared to customary reinforcement learning frameworks, AgentEvolver explores its environment more effectively, leverages data better, and adapts faster to application environments. For the enterprise, this is essential because it lowers the barrier to training agents on customized applications, making powerful, custom AI assistants more accessible to a broader range of organizations.

High costs of training AI agents

Reinforcement learning has become a major paradigm for training LLMs to act as agents that can interact with digital environments and learn from feedback. However, the development of RL agents faces fundamental challenges. First, collecting the necessary training datasets is often prohibitively costly and requires significant manual labor to create task examples, especially in novel or proprietary software environments where off-the-shelf datasets are not available.

Second, the RL techniques commonly used in LLM require the model to undergo a huge amount of trial and error to learn effectively. This process is computationally costly and unproductive. As a result, training capable LLM agents through RL remains labor-intensive and costly, limiting their deployment in custom enterprise settings.

How AgentEvolver works

The main idea of ​​AgentEvolver is to give models more autonomy in their own learning process. Scientists describe it as a “self-evolving agent system” that aims to “achieve autonomous and efficient evolution of capabilities through interaction with the environment.” It uses the reasoning ability of LLM to create self-learning loops, enabling the agent to continuously improve by directly interacting with the target environment without the need for pre-defined tasks or reward functions.

“We imagine an agent system in which LLM actively guides exploration, task generation, and performance refinement,” the researchers wrote in their paper.

The process of self-evolution is driven by three basic mechanisms that work together.

The first is asking yourself questionswhere the agent explores its environment to discover the limits of its functions and identify useful states. It’s like a fresh user clicking on the app to see what’s possible. Based on this exploration, the agent generates its own diverse set of tasks that are consistent with the user’s overall preferences. This reduces the need for manually created datasets and allows the agent and its tasks to evolve together, gradually enabling it to deal with more complicated challenges.

According to Yunpeng Zhai, a researcher at Alibaba and co-author of the paper who spoke to VentureBeat, the questioning mechanism effectively changes the model from “data consumer to data producer,” dramatically reducing the time and costs required to deploy an agent in a proprietary environment.

The second mechanism is self-navigationwhich improves exploration efficiency by reusing and generalizing past experiences. AgentEvolver learns from successful and unsuccessful trials and uses them to guide future actions. For example, if an agent tries to employ an API function that does not exist in the application, it records this as an experience and learns to verify the existence of the functions before trying to employ them in the future.

Third mechanism, attributing to yourselfincreases learning efficiency by providing more detailed feedback. Instead of just a final signal of success or failure (a common practice in RL that can result in limited rewards), this mechanism uses LLM to evaluate the contribution of each individual action to a multi-step task. It retrospectively determines whether each step had a positive or negative impact on the final result, giving the agent detailed feedback that accelerates learning.

This is crucial in regulated industries where how an agent solves a problem is as essential as the outcome. “Rather than rewarding a student only for the final answer, we also evaluate the clarity and correctness of each step of their reasoning,” Zhai explained. This improves transparency and encourages the agent to adopt more hearty and auditable problem-solving patterns.

“By shifting the training initiative from human-designed pipelines to LLM-led self-improvement, AgentEvolver establishes a new paradigm that paves the way towards scalable, cost-effective, and continuously improving intelligent systems,” the researchers state.

The team also developed a practical, comprehensive training framework that integrates these three mechanisms. A key part of this foundation is Context managera component that controls the agent’s memory and interaction history. Although today’s benchmarks test a constrained number of tools, real-world enterprise environments can contain thousands of APIs.

Zhai admits this is a major challenge in the field, but notes that AgentEvolver was designed with extensibility in mind. “Searching across extremely large runspaces will always present computational challenges, but the AgentEvolver architecture provides a clear path to a scalable tool in the enterprise environment,” he said.

A more effective agent training path

To measure the effectiveness of their framework, the researchers tested it The world of applications AND BFCL version 3two benchmarks that require agents to perform long, multi-step tasks using external tools. They used models from Alibaba Qwen2.5 Family (parameters 7B and 14B) and compared their performance with a baseline model trained using GRPO, a popular RL technique used to develop reasoning models such as DeepSeek-R1.

The results showed that integrating all three mechanisms in AgentEvolver led to significant performance gains. For the 7B model, the average score improved by 29.4% and for the 14B model by 27.8% compared to the baseline values. The framework consistently improved the models’ reasoning and task performance capabilities across both benchmarks. The most significant improvement is in the questioning module, which autonomously generates a variety of training tasks and directly addresses the data scarcity problem.

Experiments also demonstrated that AgentEvolver can efficiently synthesize gigantic amounts of high-quality training data. The tasks generated by the question module turned out to be so diverse that good training effectiveness could be achieved with a tiny amount of data.

For enterprises, this provides the ability to create agents for custom applications and internal workflows while minimizing the need to manually annotate data. By providing high-level goals and allowing the agent to generate its own training experiences, organizations can develop custom AI assistants in a simpler and cheaper way.

“This combination of algorithmic design and engineering pragmatics positions AgentEvolver as both a research tool and a reusable foundation for creating adaptive tooled agents,” the researchers conclude.

Looking ahead, the ultimate goal is much bigger. “A truly ‘unique model’ that can be deployed in any software environment and mastered in a day is certainly the holy grail of agentic AI,” Zhai said. “We see AgentEvolver as an essential step in this direction.” While this future still requires breakthroughs in model reasoning and infrastructure, self-development approaches are paving the way.

Latest Posts

More News