In the coming years, agents should take over more and more homework on behalf of people, including the utilize of computers and smartphones. For now, however, they are too susceptible to mistakes to be very used.
The novel agent called S2, created by the startup Simull AI, combines border models with models specializing in using computers. The agent achieves the most newfangled performance in tasks such as using the application and manipulation of files-suggests that turning to different models in various situations can aid agents advance.
“Agents using a computer differ from large language models and differ from coding,” says Ang Li, co -founder and general director of Simular. “It’s a different kind of problem.”
In the Simular approach, a powerful AI of general purpose, such as GPT-4O OPENAI or Claude 3.7 Anthropic, is used to justify how to best perform the task-when smaller models of Open Source enters the tasks such as interpretation of websites.
Li, who was a researcher at Google Deepmind before setting up Simular in 2023, explains that gigantic language models are in planning, but are not so good in recognizing elements of the graphic user interface.
S2 aims to learn from experience with an external memory module, which records users’ feedback and utilize these recordings to improve future activities.
S2 works better in particularly complicated tasks than any other model OsworldThe reference point that measures the agent’s ability to utilize the computer operating system.
For example, S2 can do 34.5 percent of tasks that include 50 steps, beating the OPENAI operator, which can perform 32 percent. Similarly, S2 evaluates 50 percent on Androidworld, a reference point for agents using smartphones, while the next best agent is 46 percent.
Victor Zhong, an IT specialist at the University of Waterloo in Canada and one of the creators of Osworld, believes that future gigantic AI models may contain training data that aid them understand the visual world and understand graphical user interfaces.
“This will help agents move in GUI with much higher precision,” says Zhong. “I think that in the meantime, before such basic breakthroughs, the most modern systems will resemble Simarl because they combine many models to patch the limitations of individual models.”
To prepare for this column, I used Simular for booking and searching Amazon in search of offers, and it seemed that I tried some of the Open Source agents last year, including Autogenous AND vimgpt.
But it seems that even the smartest AI agents are still concerned about the edges and sometimes show strange behavior. In one case, when I asked S2 for aid in finding contact information for scientists behind Osworld, the agent got stuck in the loop jumping between the project side and the login for Discord Osworld.
Osworld benchmark shows why agents remain more scrubus than reality. While people can perform 72 percent of Osworld’s tasks, agents are thwarted in 38 percent of the time for complicated tasks. To say when the reference point was introduced in April 2024, the best agent could only perform 12 percent of tasks.