Google Cloud Run uses Nvidia GPUs for serverless AI inference

Share

Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more

There are various costs associated with running AI, and one of the most fundamental is providing the GPU computing power necessary for inference.

Until now, organizations that need to provide AI inference have had to run long-lived instances in the cloud or provision hardware on-premises. Today Google Cloud presents a modern approach that could change the landscape of AI application deployment. Google Cloud Run serverless offering now integrates Nvidia L4 GPUs, enabling organizations to run serverless applications.

The promise of serverless is that the service only runs when needed, and users only pay for what is used. This is in contrast to a typical cloud instance that will run for a set period of time as a persistent service and is always available. With serverless, in this case, the inference GPU only starts when needed.

Serverless inference can be implemented as Nvidia NIM, as well as other frameworks such as VLLM, Pytorch, and Ollama. The addition of Nvidia L4 GPUs is currently in preview.

“As customers increasingly adopt AI, they are looking to run AI workloads like inference on platforms they are familiar with and run on,” Sagar Randive, Product Manager, Google Cloud Serverless, told VentureBeat. “Cloud Run users prefer the platform’s performance and agility and have been asking Google to add GPU support.”

Bringing AI to the Serverless World

Cloud Run, Google’s fully managed serverless platform, is a popular platform among developers for its ability to simplify container deployment and management. However, the growing demands of AI workloads, especially those requiring real-time processing, have underscored the need for more tough compute resources.

Integrating GPU support opens up a wide range of employ cases for Cloud Run developers, including:

Real-time reasoning using lightweight, open models such as Gemma 2B/7B or Llama3 (8B) enables the creation of responsive, custom chatbots and instant document summarization tools.
Delivering customized, fine-tuned generative AI models, including brand-specific image generation applications, that can scale based on demand.
Accelerate compute-intensive services like image recognition, video transcoding, and 3D rendering, with the ability to scale to zero when not in employ.

Serverless performance can scale to meet AI inference needs

A common concern with serverless is performance. After all, if a service isn’t always up and running, there’s often a performance hit to getting the service up and running from a cool start.

Google Cloud aims to allay any concerns about performance, citing impressive metrics for the modern GPU-enabled Cloud Run instances. According to Google, cool boot times range from 11 to 35 seconds for various models, including the Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama 3.1 8b, showing how responsive the platform is.

Each Cloud Run instance can be equipped with a single Nvidia L4 GPU with up to 24GB of vRAM, providing a solid level of resources for many common AI inference tasks. Google Cloud also aims to be model agnostic in terms of what it can run, although it hedges its bets a bit.

“We don’t restrict any LLMs, users can run any models they want,” Randive said. “However, for best performance, it is recommended to run models at 13B.”

Will it be cheaper to run serverless AI inference?

The key advantage of serverless technology is better hardware utilization, which should also translate into lower costs.

The question of whether it is actually cheaper for organizations to implement AI inference in a serverless or long-lived server-based model is a somewhat complicated one.

“It depends on the application and expected traffic pattern,” Randive said. “We’ll be updating our pricing calculator to reflect the new GPU pricing for Cloud Run, at which point customers will be able to compare their total operating costs across platforms.”

VB Daily

Stay up to date! Get the latest news in your inbox every day

By subscribing, you agree to the VentureBeat Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occurred.

The AI Sckool

Categories

Google Cloud Run uses Nvidia GPUs for serverless AI inference

Bringing AI to the Serverless World

Serverless performance can scale to meet AI inference needs

Will it be cheaper to run serverless AI inference?

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet

7 Real Python Projects You Can Build in 2026 (with Guides)

Start building with Nano Banana 2 Lite and Gemini Omni Flash

More News

Almost anyone can now sell you GLP-1 on the Internet

Trump Administration Lifts Export Controls on Anthropic’s Mythos and Fable AI Models

Trucks full of Tesla batteries are constantly stolen before they even leave the factory

Everyone is furious over the fresh World Cup ‘hydration breaks’ – except Mr Moneybags here

Penalties: Does the team that kicks first have a better chance of winning?

3 questions: Beyond data-driven aesthetics

Almost anyone can now sell you GLP-1 on the Internet