Wednesday, March 11, 2026

Google Cloud focuses on CoreWeave and AWS with managed Slurm for enterprise-scale AI training

Share

Some businesses may be best served by tailoring gigantic models to their needs, but many companies plan to do so build your own modelsa project that would require access to GPUs.

Thanks to the fresh service, Google Cloud wants to play a greater role in the process of creating enterprise models, Vertex AI training. The service provides enterprises looking to train their own models with access to Slurm’s managed environment, data analytics tools, and all the arrangements needed to train models at scale.

With this fresh service, Google Cloud hopes to lure more enterprises away from other providers and encourage the creation of more customized AI models for businesses.

While Google Cloud has always offered the ability to customize its Gemini models, the fresh service allows customers to bring their own models or customize any Google Cloud hosts with an open source model.

The Vertex AI training puts Google Cloud directly against the background of companies such as CoreWeave AND Lambda Labsas well as its cloud competitors AWS AND Microsoft Azure.

Jaime de Guerre, senior director of product management at Gloogle Cloud, told VentureBeat that the company has heard from many organizations of all sizes that they need a way to better optimize compute, but in a more reliable environment.

“We are seeing a growing number of companies building or adapting high-generation AI models to launch product offerings built around those models or to support their business in some way,” de Guerre said. “This includes AI startups, technology companies, sovereign organizations building a model for a specific region, culture or language, as well as some large enterprises that may incorporate it into internal processes.”

De Guerre noted that while technically anyone can utilize the service, Google’s focus is on companies planning to train models at scale, rather than plain tuning or LoRA implementers. Vertex AI services will focus on long-term training tasks involving hundreds or even thousands of chips. Pricing will depend on the amount of computing power the enterprise will need.

“Vertex AI training is not about adding more information to the context or using RAGs; it is about training a model where you can start with completely random weights,” he said.

Model customization is becoming more and more popular

Enterprises are seeing the value of building custom models beyond just tuning LLM using search assisted generation (RAG). Custom models would know more detailed information about the company and provide answers specific to the organization. Companies like Arcee.ai have started offering their models to adapt to customers. Adobe A fresh service was recently announced that enables businesses retrain Firefly for their specific needs. Organizations like FIKOthat create diminutive language models specific to the financial industrythey often purchase GPUs for training at significant cost.

Google Cloud says Vertex AI Training is distinguished by access to a larger set of chips, training monitoring and management services, and expertise gained from training Gemini models.

Some of Vertex AI Training’s early customers include: AI Singaporea consortium of Singaporean research institutes and start-ups that built SEA-LION v4 with 27 billion parameters, and Sales powerAI research team.

Companies often have to choose between using an already built LLM and improving it, or building their own model. However, creating an LLM from scratch is usually unattainable for smaller companies or simply doesn’t make sense for some utilize cases. However, for organizations where a fully custom or ground-up model makes sense, the challenge is gaining access to the GPUs needed to deliver training.

Training models can be costly

Model training, de Guerre said, can be complex and costlyespecially when organizations are competing with several others for GPU space.

Hyperscale companies like AWS and Microsoft — and, of course, Google — claim that their massive data centers and racks and racks of high-end chips provide the most value to enterprises. Not only will they have access to costly GPUs, but cloud providers often offer full-stack services to facilitate enterprises move to production.

Services like CoreWeave have gained prominence by offering on-demand access Nvidia H100s, providing customers with flexibility in computing power when building models or applications. It also gave rise to a business model in which companies with GPUs rent server space.

De Guerre said Vertex AI Training isn’t just about offering access to model trains on pure computing power when a company rents a GPU server; they must also bring their own training software and manage time and failures.

“It’s a managed Slurm environment that will help you schedule all your jobs and automatically restore failed jobs,” de Guerre said. “So if a training task slows down or stops due to a hardware failure, the training will automatically resume very quickly based on the automatic checkpoints we perform as part of checkpoint management to continue with very little downtime.”

He added that this provides greater throughput and more proficient training for larger scale computing clusters.

Services like Vertex AI Training can make it easier for enterprises to build niche models or completely adapt existing models. However, just because an option exists doesn’t mean it’s right for every business.

Latest Posts

More News