Friday, March 20, 2026

Google Cloud Run uses Nvidia GPUs for serverless AI inference

Share


Join our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more


There are various costs associated with running AI, and one of the most fundamental is providing the GPU computing power necessary for inference.

Until now, organizations that need to provide AI inference have had to run long-lived instances in the cloud or provision hardware on-premises. Today Google Cloud presents a modern approach that could change the landscape of AI application deployment. Google Cloud Run serverless offering now integrates Nvidia L4 GPUs, enabling organizations to run serverless applications.

The promise of serverless is that the service only runs when needed, and users only pay for what is used. This is in contrast to a typical cloud instance that will run for a set period of time as a persistent service and is always available. With serverless, in this case, the inference GPU only starts when needed.

Serverless inference can be implemented as Nvidia NIM, as well as other frameworks such as VLLM, Pytorch, and Ollama. The addition of Nvidia L4 GPUs is currently in preview.

“As customers increasingly adopt AI, they are looking to run AI workloads like inference on platforms they are familiar with and run on,” Sagar Randive, Product Manager, Google Cloud Serverless, told VentureBeat. “Cloud Run users prefer the platform’s performance and agility and have been asking Google to add GPU support.”

Bringing AI to the Serverless World

Cloud Run, Google’s fully managed serverless platform, is a popular platform among developers for its ability to simplify container deployment and management. However, the growing demands of AI workloads, especially those requiring real-time processing, have underscored the need for more tough compute resources.

Integrating GPU support opens up a wide range of employ cases for Cloud Run developers, including:

  • Real-time reasoning using lightweight, open models such as Gemma 2B/7B or Llama3 (8B) enables the creation of responsive, custom chatbots and instant document summarization tools.
  • Delivering customized, fine-tuned generative AI models, including brand-specific image generation applications, that can scale based on demand.
  • Accelerate compute-intensive services like image recognition, video transcoding, and 3D rendering, with the ability to scale to zero when not in employ.

Serverless performance can scale to meet AI inference needs

A common concern with serverless is performance. After all, if a service isn’t always up and running, there’s often a performance hit to getting the service up and running from a cool start.

Google Cloud aims to allay any concerns about performance, citing impressive metrics for the modern GPU-enabled Cloud Run instances. According to Google, cool boot times range from 11 to 35 seconds for various models, including the Gemma 2b, Gemma2 9b, Llama2 7b/13b, and Llama 3.1 8b, showing how responsive the platform is.

Each Cloud Run instance can be equipped with a single Nvidia L4 GPU with up to 24GB of vRAM, providing a solid level of resources for many common AI inference tasks. Google Cloud also aims to be model agnostic in terms of what it can run, although it hedges its bets a bit.

“We don’t restrict any LLMs, users can run any models they want,” Randive said. “However, for best performance, it is recommended to run models at 13B.”

Will it be cheaper to run serverless AI inference?

The key advantage of serverless technology is better hardware utilization, which should also translate into lower costs.

The question of whether it is actually cheaper for organizations to implement AI inference in a serverless or long-lived server-based model is a somewhat complicated one.

“It depends on the application and expected traffic pattern,” Randive said. “We’ll be updating our pricing calculator to reflect the new GPU pricing for Cloud Run, at which point customers will be able to compare their total operating costs across platforms.”

Latest Posts

More News