Thursday, March 12, 2026

Open Source Open Source Computer Agents competing with its own OPENAI and Anthropic models

Share


Do you want smarter insights in your inbox? Sign up for our weekly newsletters to get what is vital for AI leaders, data and security. Subscribe now


Up-to-date framework of researchers in University of Hong Kong (HKU) and cooperation institutions form the basis of Open Source to create solid AI agents that can operate computers. Frames called OpencuaIt includes tools, data and recipes for scaling of computer employ agents (CUA).

Models trained with this frame work strongly on CUA references, exceeding the existing Open Source models and strictly compete with closed agents from the leading AI laboratories, such as Opeli and Anthropic.

Challenge to build computer user agents

Computer user agents are designed for autonomous tasks on a computer, from website navigation to intricate software. They can also lend a hand in automating work flows in an enterprise. However, the most talented CUA systems are reserved, with key details about their training data, architectures and privacy development processes.

“When the lack of transparency limits technical progress and raises concerns about security, the research community needs a really open CUA RAM to examine its capabilities, restrictions and risk,” scientists say in their paper.


AI scaling hits its limits

Power capitals, the growing costs of the token and inference delay are transforming AI Enterprise. Join our exclusive salon to discover how the best teams are:

  • Changing energy into a strategic advantage
  • Architect of effective inference regarding real capacity profits
  • Unlocking competitive roi using balanced AI systems

Secure your place to remain ahead: https://bit.ly/4mwgni


At the same time, open source efforts are in front of their own obstacle set. There was no scalable infrastructure to collect various data on a enormous -scale data needed to train these agents. Existing Open Source data sets for graphic user interfaces (Guis) have restricted data, and many research projects provide insufficient details about their methods, which makes it hard to repeat their work.

According to the article, “These restrictions together make it difficult to progress in general purpose and limit the meaningful examination of their scalability, generalization and potential approaches to learning.”

We present OpenCua

OpenCua Fieldwork: XLANG LAB and HU

OpenCua is an open source framework designed to solve these challenges by scaling both data collection and models themselves. At the root is the Agentnet tool for registering human demonstrations of computer tasks in various operating systems.

The tool improves data collection by acting in the background on the Adnotator’s personal computer, intercepting screen movies, mouse and keyboard inputs, and a basic availability tree that provides structural information about screen elements. These raw data is then processed into “status trajectories”, combining a computer screenshot (state) with a corresponding user (click, pressing the key, etc.). Annotetors can then view, edit and submit these demonstrations.

Angent

Using this tool, scientists have collected a set of Agentnet data, which contains over 22,600 demonstrations of tasks in Windows, MacOS and Ubuntu, covering over 200 applications and websites. “This set of data authentically records the complexity of human behavior and environmental dynamics from the personal computing environments of users,” notes the paper.

Recognizing that the tools for recording the screen are raised by significant concerns about the privacy of data for enterprises, scientists designed the Agentnet tool with security in mind. Xinyuan Wang, co -author of the newspaper and a PhD student in HKU, explained that they had implemented a multilayer framework for privacy protection. “First of all, the adnators themselves can fully observe the generated data … Before deciding whether to submit it,” said Venturebeat. Then the data is subject to manual verification of privacy problems and automated scanning by a enormous model to detect all other sensitive content before issuing. “This layered process ensures the solidity of the corporate class for environments that support confidential customer or financial data,” Wang added.

To accelerate the assessment, the team also touched agentanetbench, offline reference, which provides many correct actions for each step, offering a more capable way of measuring the agent’s performance.

Up-to-date recipe for training agents

OpenCua framework introduces an pioneering pipeline for data processing and training of computer employ agents. The first step transforms strict interpersonal demonstrations into pure state couples suitable for training models in the language of vision (VLM). However, scientists have found that simply training models of these couples gives restricted performance increases, even with enormous amounts of data.

OpenCua chain-translating pipeline Source: xlang lab in HKU

The key insight was the raise in these trajectories about the reasoning of the chain (COT). This process generates a detailed “internal monologue” for each activity that includes planning, memory and reflection. This structured reasoning is organized in three levels: a high -level screen observation, reflective thoughts that analyze the situation and plan subsequent steps, and finally concise, executable action. This approach helps the agent develop a deeper understanding of tasks.

“We believe that natural language reasoning is crucial for generalized models of computer foundations, helping CUA internalize cognitive abilities,” scientists write.

This pipeline with data synthesis is a general frame that companies can be adapted to the training of agents on their own unique internal tools. According to Wang, the company can record demonstrations of its reserved flow and employ the same “reflector” and “generator” pipeline to create the necessary training data. “This allows them to break down into a high-performance agent adapted to their internal tools without the need for manual manual reasoning,” he explained.

OpenCua transmission to the test

Scientists used OpenCua frames for training a series of VLM Open Source, including QWEN and KIMI-VL variants, with a size of parameters from 3 billion to 32 billion. The models were evaluated in the package of online and offline reference points that test their ability to perform tasks and understand GUI.

Model 32 billion parameters, OpenCua-32B, has established a novel most up-to-date success rate among Open Source models with references to OSWORLD. He also exceeded CUA based on GPT-4O OPENAI and significantly closed the gap in performance using the leading reserved anthropics models.

OpenCua shows a huge improvement compared to basic models (on the left) while competing with leading CUA (right) models Source: XLANG LAB in HK

For corporate programmers and product leaders, research offers several key arrangements. The OpenCua method is wide, which improves the performance of models of various architecture (both dense and mix of experience) and sizes. Trained agents also show a sturdy generalization, well achieving various tasks and operating systems.

According to Wang, the frames are particularly suitable for the automation of repetitive, labor -intensive work flows of enterprises. “For example, in the AGENTNET data set, we already capture several demonstrations of starting EC2 instance to Amazon AS AWS and configuring the annotation parameters on Mttu,” said Venturebeat. “These tasks include many sequential steps, but perform repetitive patterns.”

However, Wang noticed that the bridging of the gap in order to implement live requires solving key challenges related to safety and reliability. “The biggest challenge in real implementation is security and reliability: the agent must avoid errors that can accidentally change the system settings or cause harmful side effects outside the intended task,” he said.

Scientists have released codeIN Data setAND weights for their models.

Because Open Source agents built on such as OpenCua become more talented, they can generally develop a relationship between knowledge employees and their computers. Wang predicts a future in which proficiency in intricate software becomes less vital than the ability to clearly express goals to the AI ​​agent.

He described two basic ways of working: “Offline automation in which the agent uses his broader knowledge about software to continue the task from end to end” and “Online cooperation in which the agent reacts in real time and works next to man, like a colleague.” Basically, people will provide strategic “what”, while the more and more sophisticated AI agents deal with the operational “how”.

Latest Posts

More News