AI agent demos may seem mind-blowing, but ensuring the technology works reliably without annoying or costly errors in real life can be a challenge. Current models can answer questions and converse with near-human skills and form the basis of chatbots such as OpenAI’s ChatGPT and Google’s Gemini. They can also perform tasks on the computer with a elementary command by accessing the computer screen as well as input devices such as a keyboard and trackpad, or through low-level software interfaces.
Anthropic claims that Claude outperforms other AI agents in several key tests, including SWE benchwhich measures an agent’s software development skills and OSWorldwhich measures the agent’s ability to utilize the computer’s operating system. The claims have not yet been independently verified. Anthropic claims that Claude completes OSWorld tasks correctly 14.9% of the time. That’s significantly lower than humans, who generally achieve about 75 percent success, but significantly higher than current top agents, including OpenAI’s GPT-4, which succeed about 7.7 percent of the time.
Anthropic says several companies are already testing an agent-based version of Claude. This includes Canvawho uses it to automate design and editing tasks and Repeatwhich uses the model to encode tasks. Other early adopters include: Company Browser, Asana AND Concept.
About the pressa postdoctoral researcher at Princeton University who helped develop SWE-bench says agentic AI typically lacks the ability to plan far ahead and often has difficulty correcting errors. “To demonstrate their usefulness, we need to perform well on difficult and realistic tests,” he says, such as reliably planning a wide range of trips for the user and booking all the necessary tickets.
Kaplan notes that Claude can already solve some bugs surprisingly well. For example, if a terminal error occurred while trying to start a web server, the model knew how to change its command to fix it. It also turned out that he must have enabled pop-ups when he encountered a dead end while browsing the web.
Many technology companies are currently racing to develop AI agents, chasing market share and positioning. In fact, it may not be long before many users have agents at their fingertips. Microsoft, which has pumped more than $13 billion into OpenAI, says it is testing agents that can utilize Windows computers. Amazon, which has invested heavily in Anthropic, is exploring how agents could recommend and ultimately purchase goods to their customers.
Sonya Huang, a partner at the venture firm Sequoia, which focuses on artificial intelligence companies, says that despite all the hype around AI agents, most companies are actually just rebranding their AI tools. In an interview with WIRED ahead of Anthropic News, he says the technology currently works best when applied to narrow fields, such as coding work. “You have to choose problem areas where if the model fails there is no problem,” he says. “These are problematic spaces where truly home-grown agent companies will emerge.”
A key challenge with agent-based AI is that errors can be much more problematic than a garbled chatbot response. Anthropic placed certain restrictions on what Claude could do, such as limiting his ability to utilize a person’s credit card to purchase things.
If mistakes can be avoided enough, says Princeton University’s Press, users can learn to see artificial intelligence – and computers – in entirely recent ways. “I’m very excited about this new era,” he says.
