Calculations for AI agents do not add up

Share

Great artificial intelligence companies he promised us that 2025 will be the “year of AI agents.” It turned out to be a year talk about AI agents and kicking the can down the road for this moment of transformation by 2026 and maybe beyond. But what if the answer to the question: “When will our lives be fully automated by generative AI robots that perform our tasks for us and basically rule the world?” it’s like this New Yorker cartoon.“Or maybe never?”

That was essentially the message of an article published without much fanfare a few months ago, right in the middle of the overhyped year of “agentic AI.” Entitled “Hallucinatory Stations: On Some Fundamental Limitations of Transformer-Based Language Models”, is intended to mathematically demonstrate that “LLMs are unable to perform computational and agentic tasks above a certain complexity.” Even though the science is beyond me, the authors – a former SAP chief technology officer who studied artificial intelligence under one of the field’s founders, John McCarthy, and his teenage prodigy son – pierced their vision of an agent’s paradise with the confidence of mathematics. They argue that even reasoning models that go beyond the pure LLM word prediction process will not solve the problem.

“There’s no way you can rely on them,” Vishal Sikka, the dad, tells me. After a career that included roles in addition to SAP as CEO of Infosys and a board member of Oracle, he now heads an AI services startup called Viana. “So we should forget about AI agents running nuclear plants?” I’m asking. “Exactly,” he says. You may be able to get him to file some paperwork or something to save you time, but you may have to put up with some mistakes.

The artificial intelligence industry begs to differ. First, coding has been a massive success story for agent AI, which has gained momentum in the last year. Just this week in Davos, Google’s Nobel Prize-winning AI chief, Demis Hassabis, reported breakthroughs in minimizing hallucinations, and both hyperscalers and startups are pushing the agent narrative. Now they have some support. The so-called startup Harmonic reports on a breakthrough in AI coding, which is also based on mathematics and is at the forefront of benchmarks reliability.

Harmonic, co-founded by Robinhood CEO Vlad Tenev and Tudor Achim, a Stanford-educated mathematician, says this recent improvement to its product called Aristotle (no hubris!) shows that there are ways to ensure the trustworthiness of artificial intelligence systems. “Are we doomed to a world where artificial intelligence just generates errors and humans can’t actually check it? That would be a crazy world,” says Achim. The Harmonic solution involves applying formal mathematical reasoning methods to check LLM results. Specifically, it encodes the output in the Lean programming language, which is known for its ability to verify coding. Of course, Harmonic’s interests have so far been narrow – its key mission has been the pursuit of “mathematical superintelligence”, and coding is in some sense an organic extension. Things like historical essays – which cannot be verified mathematically – are beyond its limits. For now.

Nevertheless, Achim does not seem to think that agent reliability is as massive a problem as some critics believe. “I would say that most models at this stage have the level of pure intelligence required to make decisions when booking an itinerary,” he says.

Both sides are right – maybe even on the same side. On the one hand, everyone agrees that hallucinations will continue to be a vexing reality. IN article published in September last year, OpenAI researchers wrote: “Despite significant progress, hallucinations continue to plague the field and are still present in recent models.” They substantiated this unfortunate claim by asking three models, including ChatGPT, for the title of the lead author’s dissertation. All three invented false titles and all misstated the year of publication. In a blog dedicated to this newspaper, OpenAI grimly stated that “accuracy will never reach 100 percent” in AI models.

The AI Sckool

Categories

Calculations for AI agents do not add up

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Anthropic’s contract with the Pentagon is a warning to startups chasing federal contracts

When AI companies go to war, security gets left behind

5 Powerful Python Decorators for Optimizing LLM Applications

More News

When AI companies go to war, security gets left behind

War with Iran threatens global chip supplies and the expansion of artificial intelligence

ByteDance’s artificial intelligence ambitions are hampered by computational limitations and copyright concerns

OpenAI banned military applications. The Pentagon tested its models through Microsoft anyway

5 useful Python scripts to automate exploratory data analysis

Sleep apnea often goes undetected in women. This is starting to change

Anthropic’s contract with the Pentagon is a warning to startups chasing federal contracts