Saturday, March 14, 2026

Application trap: How cloud suppliers eat AI margins

Share

This article is part of the special edition of VentureBeat “Real Cost AI: performance, performance and large -scale roi”. Read more from this special edition.

AI became a holy grale of contemporary companies. Regardless of whether it is customer service or something niche as pipeline maintenance, organizations in every field are now implementing AI technologies – from the models of foundations to owners – to augment efficiency. The goal is elementary: automate tasks to provide more efficiently results and save money and resources at the same time.

However, when these projects go from the pilot to the production stage, the teams encounter an obstacle that they did not plan: the costs in the cloud are eroded by margins. The shock of the sticker is so bad that what once seemed to be the fastest path to innovation and competitive advantage, becomes an unbalanced black budget hole – in the blink of an eye.

This prompts the CIO to think about everything – from the architecture model to the implementation models – to regain control of financial and operational aspects. Sometimes the projects are even completely flashing, starting from scratch.

But here is the fact: although the cloud can charge costs to the level of level, it is not a villain. You just need to understand what type of vehicle (AI infrastructure) to choose which road (load).

Cloud history – and where it works

The cloud is very similar to public transport (metro and buses). You board with a elementary rental model and immediately gives you all the resources – from the GPU instance to rapid scaling of various geography – to take you to the destination, all with minimal work and configuration.

Quick and simple access via the service model provides a trouble -free start, paving a way to leave the project and quick experiments without huge investment expenses on obtaining specialized graphic processors.

Most of the startups at an early stage think this lucrative model, because they need a quick turn more than anything else, especially when they still validate the model and determine the matching of the product market.

“You are doing an account, click a few buttons and you gain access to servers. If you need a different GPU size, you will close and restart the instance with the help of new specifications that takes minutes. If you want to conduct two experiments at the same time, you will initiate two separate instances. In the early stages you focus on checking the improvement in ideas. cloud between milestons, which sarin, emphasis, the emphasis is on the Voice AI product in SpeechMaticsVenturebeat said.

The cost of “ease”

While the cloud makes sense when the early stage is used, infrastructure mathematics becomes gloomy when the project passes from testing and validation to volumes in the real world. The scale of loads makes the bills brutal – to the extent that the costs could augment by over 1000% overnight.

This is especially true in the case of inference, which not only has to start 24 hours a day, 7 days a week to ensure working time, but also scale with customer demand.

In most cases, Sarin explains that the demand is growing upon application, when other customers also ask for access to the GPU, increasing competition with resources. In such cases, the teams either retain the reserved capacity to make sure that they get what they need-leading to the idle time of the GPU in hours without peak-they suffer from delays, affecting lower experience.

Christian Khoury, CEO AI Compliance Platform Easaudit AIdescribed by the application as a modern “cloud tax”, saying Venturebeat that he saw the companies pass from USD 5,000 to USD 50,000 per month, only from the motion of inference.

It is also worth noting that the burden on application with LLM participation, with prices based on tokens, can cause the highest increases in costs. This is due to the fact that these models are incredible and can generate various results when servicing long -term tasks (including huge contextual windows). Thanks to continuous updates, it is arduous to forecast or control the costs of applying LLM.

The training of these models, for its part, happens that it is “explosive” (appearing in clusters), which leaves some space for capacity planning. However, even in these cases, especially since the growing competitive forces often retrained, enterprises can have huge bills since the inactive GPU, resulting from excessive rewriting.

“Training loans on cloud platforms are expensive, and frequent retraining during fast iteration cycles can quickly escalate costs. Long trainings require access to large machines, and most cloud suppliers guarantee access to access, if you reserve capacity for a year or longer. If the training lasts only a few weeks, you still pay for the rest of the year,” Sarin explained.

And that’s not just that. The cloud lock is very real. Suppose you made a long -term reservation and bought loans from the supplier. In this case, you are enclosed in their ecosystem and you must exploit what they have on offer, even when other suppliers moved to a newer, better infrastructure. And finally, when you have the opportunity to move, you may have to pay huge fees for the exit.

“This is not only the calculation cost. You get … unpredictable autoscaling and crazy output fees, if you transfer data between regions or suppliers. One team paid more for transferring data than to training your models,” emphasized Sarin.

So what is the bypass?

Given the constant demand for infrastructure in the scale of AI inference and the explosive nature of training, enterprises go to the division of loads-converting a request for collocations or location stacks, while leaving the cloud training with point instances.

This is not just a theory – it is a growing movement among engineering leaders trying to introduce artificial intelligence to production without smoking through the runway.

“We helped the teams go to collocation to apply using dedicated GPU servers that they control. This is not sexy, but limits monthly expenses in infra by 60-80%” – added Khoury. “Hybrid is not only cheaper – it is smarter.”

In one case, he said, Saas reduced the monthly bill of AI infrastructure from around USD 42,000 to just USD 9,000, transferring the charges of the cloud. The Switch paid off in less than two weeks.

Another team requiring consistent Pod-50MS answers to the AI ​​customer service tool that the cloud-based application delay was insufficient. Changing the application closer to users through collocations not only solved the bottleneck of performance – but reduced the reduction of the cost.

Configuration usually works like this: inference, which is always included and sensitizing delay, works on dedicated graphic or local processors or in a nearby data center (collocation object). Meanwhile, training, which is intense, but sporadic, remains in the cloud, where you can rotate powerful clusters on demand, run for several hours or days and close.

It is basically estimated that hiring Hyerscale clouds from suppliers can cost three to four times more per hour of GPU than working with smaller suppliers, with the difference is even more significant compared to local infrastructure.

Second great bonus? Predictability.

Thanks to piles of location or collocation, teams also have full control over the number of resources that want to ensure or add the expected base line of the application load. This gives the predictability of infrastructure costs – and eliminates bills for surprise. Also, consider aggressive engineering efforts to tune the scaling and maintaining the costs of infrastructure in the cloud in relation to reason.

Hybrid configurations also support reduce the delay in sensitive AI application and allow better compatibility, especially in the case of teams operating in highly regulated industries, such as finance, healthcare and education-education-where the stay and management of data is not negotiated.

Hybrid complexity is true – but rarely a broken

As always, the transition to hybrid configuration contains its own OPS tax. Configuring your own equipment or renting a collocation plant requires time, and GPU management outside the cloud requires a different type of engineering muscles.

However, leaders say that complexity is often overstated and is usually possible to manage internally or through external support, unless someone works on an extreme scale.

“Our calculations show that a local GPU server costs about the same as six to nine months of renting an equivalent instance from AWS, Azure or Google Cloud, even with a annual rate reserved. Because the equipment usually lasts at least three years, and often more than five, it becomes additional in the first nine months. Some equipment also offers operational models for capital infrastructure, and also Cash for cash, as well as the payment of current.

Priorities as needed

For each company, whether startups or enterprises, the key to success during architect-like architect-infrastructure AI is about working in accordance with specific loads.

If you are not sure about charging various AI loads, start with the cloud and carefully observe related costs, marking each resource with a responsible team. You can share these cost reports to all managers and deeply delve into what they exploit and its impact on resources. These data then give you transparency and support pave the way to ride.

Having said, remember that it is not about abandoning the cloud; It is about optimizing its exploit to maximize performance.

“The cloud is still great for experiments and entertainment training. But if inference is your basic burden on work, get off the treadmill. Hybrid is not only cheaper … It is smarter,” added Khour. “Treat the cloud like a prototype, not a permanent house. Launch mathematics. Talk to the engineers. The cloud will never tell you when the wrong tool. But your AWS bill will be.”

Latest Posts

More News