For more than a decade, conversational AI has promised human-like assistants that can do more than chat. However, even as huge language models (LLMs) like Chatgpt, Gemini, and Claude learn to reason, explain, and code, one critical category of interaction remains largely unresolved – reliably completing tasks for humans outside the chat.
Even The best AI models only evaluate in 30th percentile on the pitch, hard, External benchmark designed to evaluate the performance of AI agents on a variety of browser-based tasks, well below the reliability required by most enterprises and users. And task-specific testing tests like tau-bench airline, who measures reliability of AI agents in finding and booking flights On behalf of the user they also do not have much higher pass rates, with only 56% for top agents and models (Claude 3.7 Sonnet) – which means the agent fails almost half the time.
Modern York Augmented Intelligence (AUI) Inc.co-founder Ohad Elhelo AND Ori Cohenbelieves this solution has finally arrived to boost the reliability of the AI agent to a level where most enterprises can trust it to do it as instructed, reliably.
The company’s modern foundation model, called Apollo-1 – which remains in the early tester version, but is close to its upcoming general release – is built on the principle it calls State neuro-symbolic reasoning.
It is a hybrid architecture advocated by even LLM skeptics like Gary Marcusdesigned to ensure consistent policy-compliant results across every customer interaction.
“Conversational AI is essentially two halves,” Elhelo said in a recent interview with VentureBeat. “The first half – open dialogue – is handled beautifully by LLM. They are designed for creative or exploratory use cases. The second half is task-oriented dialogue, where there is always a specific purpose behind the conversation. This half is left unresolved because it requires certainty.”
AUI defines confidence as the difference between an agent that “probably” performs a task and one that almost “always” performs.
For example, HE Tau-bench airline, operates with a staggering 92.5% pass rateleaving all other current competitors far behind in the dust, according to benchmarks shared with VentureBeat and Posted on AUI website.
Elhelo gave elementary examples: a bank that must enforce refund ID verification on refunds over $200, or an airline that must always offer a pre-economy business class upgrade.
“It’s not a preference,” he said. “These are requirements. And no purely generative approach can provide this kind of behavioral certainty.”
AUI and its work on improving reliability was previously covered by subscription information Informationbut has not yet received widespread coverage in publicly available media.
From pattern matching to predictable action
The team claims that Transformer models, by design, cannot meet this bar. Huge language models generate probable text, not guaranteed behavior. “When you tell LLM to always offer insurance before payment, it usually can,” Elhelo said. “Set up Apollo-1 with this rule and it will happen every time.”
This distinction, he said, comes from the architecture itself. Transformers predict the next token in the sequence. Apollo-1, on the contrary, predicts Next action In conversation, acting on what AUI calls inscribed symbolic state.
Cohen explained this idea in more technical terms. “Neuro-symbolic means we combine the two dominant paradigms,” he said. “The symbolic layer gives the structure – it knows what the intention, the entity and the parameter are – while the neural layer gives the fluidity of the language. The neuro-symbolic reasoner sits between them. It’s a different kind of brain for dialogue.”
Where transformers treat each output as text generation, Apollo-1 triggers a closed loop of reasoning: the encoder translates natural language into symbolic state, the state machine maintains that state, the decision engine determines the next action, the scheduler executes it, and the decoder converts the result back to language. “The process is iterative,” Cohen said. “We hang around until the job is done. That way you get determinism instead of probability.”
Foundation model for performing tasks
Unlike customary chatbots or custom automation systems, Apollo-1 is designed to serve as Foundation model For task-oriented dialogue, a single, domain-agnostic system that can be configured for banking, travel, retail or insurance through what AUI calls Monitoring system.
“The system signature is not a configuration file,” Elhelo said. “It’s a behavioral contract. You define exactly how your agent must behave in situations of interest, and Apollo-1 ensures those behaviors are carried out.”
Organizations can utilize the poem to encode symbolic interstices—intents, parameters, and rules—as well as the boundaries of state-dependent tools and rules.
For example, a food delivery app might enforce “If the mentioned allergy always informs the restaurant,” while a telecommunications provider might define “after three failed payment attempts, service suspension.” In both cases, the behavior is performed deterministically, not statistically.
Eight years in the making
AUI’s path to Apollo-1 began in 2017, when the team began encoding millions of real, task-oriented conversations supported by the agent’s 60,000-person workforce.
This work led to a symbolic language capable of separation procedural knowledge – steps, constraints and flows – z descriptive knowledge Like entities and attributes.
“The insight was that task-oriented dialogue has universal procedural patterns,” Elhelo said. “Food delivery, claims processing and order management all have similar structures. Once you model it explicitly, you can compute it deterministically.”
From there, the company built a neuro-symbolic reasoner – a system that uses the symbolic state to decide what will happen next, rather than guessing with a token prediction.
Benchmarks suggest that architecture makes a measurable difference.
In AUI’s own ratings, Apollo-1 achieved 90 percent Job completion at τ-bench-inline benchmark, compared to 60 percent for Claude-4.
It ended 83 percent live booking chats on Google flights vs 22 percent for gemini 2.5-flash i 91 percent Retail Scenarios on Amazon Vs 17 percent For Rufus.
“These are not incremental improvements,” Cohen said. “They are differences of reliability in favor of reliability in favor of size.”
A complement, not a competitor
AUI does not break down Apollo-1 as a replacement for huge language models, but as a necessary equivalent. In Elhelo’s words: “Transformers optimize for creative plausibility. Apollo-1 optimizes for behavioral confidence. Together they create the full spectrum of conversational AI.”
The model is already running in narrow pilots with undisclosed Fortune 500 companies in a variety of sectors, including finance, travel and retail.
AUI also confirmed Strategic partnership with Google and plans General availability November 2025when it opens the APIs it will post full documentation and add voice and video capabilities. Interested prospects and partners can register to receive more information on when becomes available in the form of the AUI website.
Until then, the company is keeping the details in the packaging. When asked what would happen next, Elhelo smiled. “Let’s say we’re preparing an announcement,” he said. “Soon.”
Towards conversations that work
For all its technical sophistication, the Apollo-1 pitch is elementary: make AI that companies can trust to act—not just talk. “We are on a mission to democratize access to artificial intelligence that works,” Cohen said at the end of the interview.
Whether Apollo-1 becomes the modern standard for task-oriented dialogue remains to be seen. But if the AUI architecture works as promised, the long-standing divide between chatbots that sound human and agents that reliably do human work may finally begin to close.
