The year was supposed to be 2025 year of “AI agents”, according to Nvidia CEO Jensen Huang and others in the AI industry. And in many respects, many of the leading AI model providers such as OpenAI, Google, and even Chinese competitors such as Alibaba have released refined AI models or applications designed to focus on a narrow set of tasks, such as web searches and report writing.
However, one gigantic obstacle remains to a future of highly competent and reliable AI agents: ensuring that they continue to perform a task when the task involves several steps. Third party benchmarks show that even the most powerful AI models experience more failures the more steps they take to complete a task and the longer they spend on it (in excess of hours).
AND a new academic framework called EAGLET proposes a practical and effective method to improve long-term task performance in LLM-based agents – without the need for manual data labeling or retraining.
Developed by researchers from Tsinghua University, Peking University, DeepLang AI and the University of Illinois Urbana-Champaign. EAGLET offers a “global planner” that can be integrated into existing agent workflows to reduce hallucinations and improve task efficiency.
EAGLET is a refined language model that interprets task instructions – typically provided as prompts by the user or the agent’s operating environment – and generates a high-level plan for the agent (based on its own LLM). It does not interfere during execution, but the up-front guidance helps reduce planning errors and improve task completion rates.
Solving the planning problem for long-horizon agents
Many LLM-based agents struggle with long-term assignments because they rely on step-by-step reactive reasoning. This approach often leads to trial-and-error behavior, hallucinatory planning, and ineffective trajectories.
EAGLET solves this limitation by introducing a global planning module who works with the executive agent.
Instead of combining planning and activity generation in a single model, EAGLET separates them, enabling more consistent strategies at the task level.
Two-step training process without human annotations
The EAGLET planner is trained in a two-step process that requires no human-written plans or annotations.
The first step involves generating synthetic plans using high-performance LLMs such as GPT-5 and DeepSeek-V3.1-Think.
These plans are then filtered using a novel strategy called homologous consensus filtering, which only includes those that improve task performance for both experienced and novice executive agents.
In the second stage, the rule-based reinforcement learning process further refines the planner by using a specially designed reward function to evaluate how well each plan helps multiple agents succeed.
Introducing the Contractor Enhancement Reward (ECGR)
One of EAGLET’s key innovations is the Enhanced Contractor Capacity Award (ECGR).
This reward measures the value of the generated plan by whether it helps high- and low-ability agents complete tasks more efficiently and with fewer steps.
It also includes a decay factor that favors shorter and more competent task trajectories. This approach avoids oversatisfying plans that are only useful to already competent agents and promotes more generalized planning guidelines.
Compatible with existing agents and models
The EAGLET scheduler is designed to be modular and plug and play, which means it can be incorporated into existing agent pipelines without the need to retrain contractors.
During evaluations, the scheduler improved performance on various baseline models, including GPT-4.1, GPT-5, Lama-3.1, and Qwen2.5.
It also proved effective regardless of prompting strategy, working well with standard ReAct-style prompts as well as approaches like Reflexion.
State-of-the-art benchmark performance
EAGLET was tested on three commonly used benchmarks for long-time agent tasks: ScienceWorld, which simulates science experiments in a text-based laboratory environment; ALFWorld, which tasks agents with performing household tasks using natural language in a simulated home setting; and WebShop, which assesses goal-oriented behavior in a realistic online shopping interface.
In all three cases, EAGLET-equipped executive agents outperformed their non-scheduler counterparts and other planning baselines, including MPO and KnowAgent.
In experiments with the open-source Llama-3.1-8B-Instruct model, EAGLET improved average performance from 39.5 to 59.4, an escalate of +19.9 points across all tasks.
In unseen scenarios, ScienceWorld increased performance from 42.2 to 61.6.
In the scenarios presented in ALFWorld, EAGLET improved results from 22.9 to 54.3, an escalate in performance of over 2.3 times.
Even greater increases were observed for more competent models.
For example, GPT-4.1 improved from 75.5 to 82.2 for EAGLET, and GPT-5 increased from 84.5 to 88.1 despite already performing well.
In some tests, the performance gain was as much as +11.8 points, such as when combining EAGLET with the ETO runtime method on ALFWorld hidden jobs.
Compared to other scheduling databases such as MPO, EAGLET consistently delivered higher task completion rates. For example, in ALFWorld’s unseen tasks with GPT-4.1, MPO achieved 79.1 while EAGLET achieved 83.6, an advantage of +4.5 points.
Additionally, the article reports that agents using EAGLET complete tasks in fewer steps on average. With GPT-4.1 as executor, the average number of steps decreased from 13.0 (without scheduling) to 11.1 (EAGLET). For GPT-5, it dropped from 11.4 to 9.4, which supports the claim of better execution performance.
Increased efficiency in training and execution
Compared to RL-based methods such as GiGPO, which can require hundreds of training iterations, EAGLET achieved better or comparable results with approximately one-eighth of the training effort.
This efficiency also translates into execution: agents using EAGLET typically needed fewer steps to complete tasks. This translates into reduced inference time and computational costs in production scenarios.
No public code – yet
As of the version submitted to arXiv, the authors have not yet published an open-source implementation of EAGLET. It’s unclear if or when the code will be released, under what license, or how it will be maintained, which could limit the framework’s short-term usefulness for enterprise deployment.
VentureBeat has reached out to the authors to clarify these points and will update this article when we hear back.
Questions remain about enterprise implementation
Although the scheduler is described as plug-and-play, it is unclear whether EAGLET can be easily integrated with popular enterprise agent platforms such as LangChain or AutoGen, or whether it requires a custom stack to support plan-execution separation.
Similarly, the training setup uses multiple execution agents, which may be complex to reproduce in enterprise environments with narrow access to the model. VentureBeat asked researchers whether the homology consensus filtering method could be adapted for teams that only have access to a single execution model or have narrow computing resources.
The authors of the EAGLET project report success with various types and sizes of models, but it is not yet known what is the minimum model scale that can be practically implemented. For example, can enterprise teams effectively employ the scheduler on open models with parameters below 10B in latency-sensitive environments? Additionally, the platform may offer industry-specific value in areas such as customer service or IT automation, but it remains to be seen how easily the planner can be adapted to such industries.
Real-time vs. pre-generated planning
Another open question is how to best implement EAGLET in practice. Should the planner run in real time with executors in the loop, or is it better to employ it offline to pre-generate global plans for known task types? Each approach has implications for delays, costs, and operational complexity. VentureBeat posed this question to the authors and will report any insights that emerge.
Strategic trade-offs for enterprise teams
For technical leaders in medium and vast enterprises, EAGLET provides a compelling proof of concept for improving the reliability and performance of LLM agents. However, without public tools and implementation guidelines, the framework still represents a “build rather than wait” decision. Companies must weigh the potential gains in task efficiency and performance against the costs of replicating or approximating the training process in-house.
Potential employ cases in enterprise settings
For enterprises developing agent-based AI systems – especially in environments requiring staged planning, such as IT automation, customer service or online interactions – EAGLET offers a template to enable planning without retraining. Its ability to drive both open and closed source models, along with its effective training method, could make it an attractive starting point for teams looking to improve agent performance with minimal effort.
