Even the best According to an experiment that challenges the idea of mass replacement of office workers by artificial intelligence, artificial intelligence agents are rather hopeless at working independently on the Internet.
The Remote Labor Index, a recent benchmark developed by researchers at data annotation company Scale AI and the nonprofit Center for AI Safety (CAIS), measures the ability of pioneering artificial intelligence models to automate economically valuable work.
Researchers provided several leading AI agents with a series of simulated freelance jobs and found that even the best could complete less than 3 percent of the work, earning $1,810 of a possible $143,991. The researchers looked at several tools and found Manus from the Chinese startup of the same name to be the most effective, followed by Grok from xAI, Claude from Anthropic, ChatGPT from OpenAI and Gemini from Google.
“I hope this will give a much more accurate idea of what’s happening with AI capabilities,” says Dan Hendrycks, director of CAIS. He adds that while things have improved significantly for some agents over the past year, that doesn’t mean the situation will continue at the same rate.
Spectacular advances in artificial intelligence have led to speculation that artificial intelligence will soon surpass human intelligence and replace immense numbers of workers. In March, Dario Amodei, CEO of Anthropic, suggested that 90 percent of the coding works it would be automated within a few months.
Previous waves of AI have inspired false predictions about job relocation, such as: imminent replacement of radiologists with AI algorithms.
Researchers outsourced a number of independent tasks through verified Upwork workers. Tasks cover a wide range of work, including graphic design, video editing, game development, and administrative work such as data scraping. They combined a description of each task with a catalog of files needed to complete the job and an example of a finished, man-made design.
Hendrycks says that while AI models have improved in coding, math and logical reasoning in recent years, they still struggle to utilize different tools and perform complicated tasks that involve multiple steps. “They do not have long-term memory and cannot constantly learn from experiences. They cannot acquire skills on the job like humans,” he says.
The analysis is a counterpoint to the economic work benchmark offered in September by OpenAI, the so-called GDP valuewhich is intended to measure economically valuable work. According to PKBval, pioneering AI models like GPT-5 approach human capabilities in 220 tasks across a variety of office occupations. OpenAI did not provide comment.
