Sunday, March 8, 2026

Google’s modern framework helps AI agents spend their computation and tools budget more wisely

Share

In new paper studying tool apply in agents using huge language models (LLM), researchers from Google and the University of California, Santa Barbara have developed a platform that enables agents to apply tools and computational budgets more efficiently. Researchers introduce two modern techniques: a elementary budget tracker and a more comprehensive framework called budget-aware testing time scaling. These techniques make agents clearly aware of their remaining reasoning and tooling allowance.

Since AI agents rely on tool calls to work in the real world, scaling testing time is less about smarter models and more about controlling costs and latency.

For enterprise leaders and developers, budget-conscious scaling techniques provide a practical path to deploying effective AI agents without incurring unpredictable costs or diminishing returns on compute spending.

The challenge of using scaling tools

Established test time scaling focuses on enabling models to “think” for longer. However, for agent-based tasks such as web browsing, the number of tool calls directly determines the depth and breadth of exploration.

This involves significant operational costs for businesses. “Tool calls such as web browsing consume more tokens, increase context length, and introduce additional latency,” Zifeng Wang and Tengxiao Liu, co-authors of the paper, told VentureBeat. “Tool calls themselves add additional API costs.”

The researchers found that simply giving agents more resources during testing does not guarantee better performance. “For deep research tasks, if an agent has no sense of budget, they often do it blindly,” Wang and Liu explained. “He finds one slightly related lead, then makes 10 or 20 tool calls digging deeper into it, only to realize that the entire path was a dead end.”

Resource optimization with Budget Tracker

To assess how they could optimize budgets for tool apply, researchers first tried a lightweight approach called “Budget Tracker.” This module acts as a plug-in that provides the agent with a constant signal about resource availability, allowing you to apply the tool sparingly.

The team hypothesized that “providing clear budget signals allows the model to internalize resource constraints and adapt strategy without requiring additional training.”

Budget Tracker works only at the prompt level, which makes it effortless to implement. (This article provides detailed information about the messages used in Budget Tracker, making it easier to implement.)

In Google’s implementation, the tracker provides policy briefs describing budget regimes and corresponding recommendations for using the tools. At each step of the response process, Budget Tracker clearly informs the agent about resource usage and remaining budget, allowing the agent to base subsequent steps of reasoning on the updated resource status.

To test this, researchers experimented with two paradigms: sequential scaling, in which the model iteratively refines its output, and parallel scaling, in which multiple independent runs are performed and aggregated. They conducted experiments on search agents equipped with search and browsing tools using a ReAct-style loop. ReAct (Reason + Action) is a popular method in which the model switches between internal thinking and external actions. To track the true trend of scaling costs and performance, a unified cost metric was developed that jointly considers the costs of both internal token consumption and interaction with external tools.

They tested Budget Tracker on three information-seeking QA datasets requiring external search, including BrowseComp and HLE-Search, using models such as Gemini 2.5 ProGemini 2.5 Flash i Claudius Sonnet 4. Experiments show that this elementary plugin improves performance under various budget constraints.

“Adding Budget Tracker achieves comparable accuracy, reducing search calls by 40.4%, browsing calls by 19.9%, and reducing total cost… by 31.3%,” the authors told VentureBeat. Finally, Budget Tracker continued to scale as the budget increased, while regular ReAct stabilized after reaching a certain threshold.

BATS: A Comprehensive Framework for Budget-Aware Scaling

To further improve resource optimization with tools, researchers introduced Budget-Aware Test Time Scaling (BATS), a framework designed to maximize agent performance within any budget. BATS maintains a continuous signal about remaining resources and uses this information to dynamically adjust the agent’s behavior when formulating a response.

BATS uses multiple modules to coordinate agent activities. The planning module gradually adjusts efforts to the current budget, while the verification module decides whether to “drill down” on a promising lead or “pivot” to alternative paths based on resource availability.

Given an information-seeking question and a starting budget, BATS begins by using the planning module to formulate a structured action plan and decide which tools to apply. When tools are invoked, their responses are included in the reasoning sequence to provide the context with modern evidence. Once the agent proposes a candidate response, the validation module validates it and decides whether to continue with the current sequence or start a modern trial with the remaining budget.

The iterative process ends when budget resources are exhausted, at which point the LLM-as-judge selects the best answer among all verified answers. Throughout execution, the budget tracker continuously updates both resource utilization and remaining budget in each iteration.

Researchers tested BATS on BrowseComp, BrowseComp-ZH, and HLE-Search benchmarks against baselines including standard ReAct and various training agents. Their experiments show that BATS achieves higher performance using fewer tool calls and incurring lower overall costs compared to competing methods. Using Gemini 2.5 Pro as a framework, BATS achieved 24.6% accuracy on BrowseComp compared to 12.6% on standard ReAct and 27.0% on HLE-Search compared to 20.5% on ReAct.

BATS not only improves efficiency within budget constraints, but also provides a better cost-performance ratio. For example, in the BrowseComp dataset, BATS achieved higher accuracy at a cost of approximately 23 cents compared to baseline parallel scaling, which required over 50 cents to achieve a similar result.

According to the authors, this efficiency makes previously costly workflows profitable. “This unlocks a range of long-term, data-intensive enterprise applications… such as complex code base maintenance, due diligence investigations, competitive landscape research, compliance audits and multi-stage document analysis,” they said.

As enterprises look to deploy agents that manage their own resources, the ability to balance accuracy and cost will become a key design requirement.

“We believe that the link between reasoning and economics will become inextricable,” Wang and Liu said. “In the future, [models] it must justify the value.”

Latest Posts

More News