The latest large headline in AI isn’t model size or multimodality – it’s lack of efficiency. During VentureBeat’s recent AI Impact stop in Recent York, Val Bercovici, director of AI at WEKAjoined Matt Marshall, CEO of VentureBeat, to discuss what it really takes to scale AI in the face of increasing latency, cloud dependence, and runaway costs.
These forces, Bercovici argued, are pushing AI towards its own version of skyrocketing prices. Uber became eminent for introducing skyrocketing pricing, bringing real-time market rates to ride-sharing for the first time. Now, Bercovici argued, AI is heading toward the same economic calculus – especially in inference – as the focus shifts to profitability.
“We don’t have real market rates today. We’ve subsidized rates. This was necessary to enable the many innovations that have taken place, but sooner or later – given the trillions of dollars in capital expenditures we’re talking about now, and finite operating expenses for energy – there will be real market rates; perhaps next year, and certainly by 2027.” – he said. “When they do this, it will fundamentally change the industry and drive an even deeper and greater focus on efficiency.”
Symbol explosion economics
“The first rule is that this is an industry where more is more. More tokens equals exponentially more business value,” Bercovici said.
But so far no one has figured out how to make it sustainable. The classic business triad – cost, quality and speed – translates into delays, costs and accuracy in artificial intelligence (especially in output tokens). And accuracy is non-negotiable. This applies not only to consumer interactions with agents like ChatGPT, but also to high-stakes utilize cases such as drug discovery and enterprise workflows in highly regulated industries such as financial services and healthcare.
“This is non-negotiable,” Bercovici said. “To get high inference accuracy, you need to have a large number of tokens, especially when you add security, guardrail models, and quality models into the mix. That way you give up latency and cost. That’s where you have some flexibility. If you can tolerate high latency, and sometimes you can for consumer use cases, you can get lower costs with free tiers and low cost-plus tiers.”
However, latency is a critical bottleneck for AI agents. “These agents are not working in any specific sense right now. Either there is a swarm of agents or there are no agents at all,” Bercovici noted.
In a swarm, groups of agents work in parallel to achieve a larger goal. The orchestrator agent – the most knowledgeable model – takes center stage and determines subtasks and key requirements: architecture choice, cloud or on-premises execution, performance constraints, and security considerations. The swarm then executes all subtasks, effectively involving multiple concurrent inference users in parallel sessions. Finally, evaluator models evaluate whether the overall task has been successfully completed.
“These swarms go through what are called multiple turns, hundreds if not thousands of prompts and responses, until the swarm comes together to respond,” Bercovici said.
“And if you have a compound delay in those thousands of revolutions, it becomes unsustainable. So the delay is really, really important. And that means you usually have to pay a high price today, which is subsidized, and that’s what will have to come down over time.”
Reinforcement learning as a up-to-date paradigm
Bercovici explained that until May this year, agents were not that effective. Then context windows became gigantic enough and GPUs available enough to support agents that could perform advanced tasks such as writing reliable software. It is now estimated that in some cases 90% of software is generated by coding agents. Bercovici noted that now that agents have essentially come of age, reinforcement learning is an emerging topic among data scientists at some leading labs such as OpenAI, Anthropic and Gemini, who see it as a key path forward for AI innovation.
“The current AI season is reinforcement learning. It combines multiple elements of training and inference into one unified workflow,” Bercovici said. “This latest and greatest scaling law is a mythical achievement that we are all trying to achieve, called AGI – Artificial General Intelligence,” he added. “What’s fascinating to me is that you have to apply all the best practices in model training, as well as all the best practices in model inference, to be able to repeat thousands of reinforcement learning loops and advance the entire field.”
The path to AI profitability
Bercovici said there is no single answer when it comes to building the infrastructure foundations that will make AI viable because it is still an emerging field. There is no cookie-cutter approach. Going on-premises may be the right choice for some – especially pioneering modelers – while using cloud solutions or operating in a hybrid environment may be a better path for organizations that want to innovate in a pliant and responsive manner. Regardless of which path they choose initially, organizations will need to adapt their AI infrastructure strategy as their business needs evolve.
“Unit economics are crucial here,” Bercovici said. “We’re definitely in a boom phase, or even a bubble, you might say in some cases, because the basic economics of AI are subsidized. But that doesn’t mean that if tokens become more expensive, you’re going to stop using them. You’re just going to start being very specific about how you use them.”
Leaders should focus less on the valuation of individual tokens and more on the economics at the transaction level, where efficiency and impact become apparent, Bercovici concludes.
Bercovici said the imperative question that enterprises and AI companies should be asking themselves is: “What is the real economic cost of my unit?”
Seen from this perspective, the future isn’t about limiting AI activities – it’s about doing it smarter and more efficiently at scale.
