Source URL: https://cloud.google.com/blog/products/serverless/cloud-run-gpus-are-now-generally-available/
Source: Cloud Blog
Title: Cloud Run GPUs, now GA, makes running AI workloads easier for everyone
Feedly Summary: Developers love Cloud Run, Google Cloud’s serverless runtime, for its simplicity, flexibility, and scalability. And today, we’re thrilled to announce that NVIDIA GPU support for Cloud Run is now generally available, offering a powerful runtime for a variety of use cases that’s also remarkably cost-efficient.
Now, you can enjoy the following benefits across both GPUs and CPUs:
Pay-per-second billing: You are only charged for the GPU resources you consume, down to the second.
Scale to zero: Cloud Run automatically scales your GPU instances down to zero when no requests are received, eliminating idle costs. This is a game-changer for sporadic or unpredictable workloads.
Rapid startup and scaling Go from zero to an instance with a GPU and drivers installed in under 5 seconds, allowing your applications to respond to demand very quickly. For example, when scaling from zero (cold start), we achieved an impressive Time-to-First-Token of approximately 19 seconds for a gemma3:4b model (this includes startup time, model loading time, and running the inference)
Full streaming support: Build truly interactive applications with out-of-the box support for HTTP and WebSocket streaming, allowing you to provide LLM responses to your users as they are generated.
Support for GPUs in Cloud Run is a significant milestone, underscoring our leadership in making GPU-accelerated applications simpler, faster, and more cost-effective than ever before.
“Serverless GPU acceleration represents a major advancement in making cutting-edge AI computing more accessible. With seamless access to NVIDIA L4 GPUs, developers can now bring AI applications to production faster and more cost-effectively than ever before.” – Dave Salvator, director of accelerated computing products, NVIDIA
aside_block
AI inference for everyone
One of the most exciting aspects of this GA release is that Cloud Run GPUs are now available to everyone for NVIDIA L4 GPUs, with no quota request required.This removes a significant barrier to entry, allowing you to immediately tap into GPU acceleration for your Cloud Run services. Simply use –gpu 1 from the Cloud Run command line, or check the “GPU" checkbox in the console, no need to request quota:
Production-ready
With general availability, Cloud Run with GPU support is now covered by Cloud Run’s Service Level Agreement (SLA), providing you with assurances for reliability and uptime. By default, Cloud Run offers zonal redundancy, helping to ensure enough capacity for your service to be resilient to a zonal outage; this also applies to Cloud Run with GPUs. Alternatively, you can turn off zonal redundancy and benefit from a lower price for best-effort failover of your GPU workloads in case of a zonal outage.
Multi-regional GPUs
To support global applications, Cloud Run GPUs are available in five Google Cloud regions: us-central1 (Iowa, USA), europe-west1 (Belgium), europe-west4 (Netherlands), asia-southeast1 (Singapore), and asia-south1 (Mumbai, India), with more to come.
Cloud Run also simplifies deploying your services across multiple regions. For instance, you can deploy a service across the US, Europe and Asia with a single command, providing global users with lower latency and higher availability. For instance, here’s how to deploy Ollama, one of the easiest way to run open models, on Cloud Run across three regions:
code_block
<ListValue: [StructValue([(‘code’, ‘gcloud run deploy my-global-service \\\r\n –image ollama/ollama –port 11434 \\\r\n –gpu 1 \\\r\n –regions us-central1,europe-west1,asia-southeast1’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3eb98e1cbbb0>)])]>
See it in action: 0 to 100 NVIDIA GPUs in four minutes
You can witness the incredible scalability of Cloud Run with GPUs for yourself with this live demo from Google Cloud Next 25, showcasing how we scaled from 0 to 100 GPUs in just four minutes.
Load testing a Stable Diffusion service running on Cloud Run GPUs to 100 GPU instances in four minutes.
Unlock new use cases with NVIDIA GPUs on Cloud Run jobs
The power of Cloud Run with GPUs isn’t just for real-time inference using request-driven Cloud Run services. We’re also excited to announce the availability of GPUs on Cloud Run jobs, unlocking new use cases, particularly for batch processing and asynchronous tasks:
Model fine-tuning: Easily fine-tune a pre-trained model on specific datasets without having to manage the underlying infrastructure. Spin up a GPU-powered job, process your data, and scale down to zero when it’s complete.
Batch AI inferencing: Run large-scale batch inference tasks efficiently. Whether you’re analyzing images, processing natural language, or generating recommendations, Cloud Run jobs with GPUs can handle the load.
Batch media processing: Transcode videos, generate thumbnails, or perform complex image manipulations at scale.
What Cloud Run customers are saying
Don’t just take our word for it. Here’s what some early adopters of Cloud Run GPUs are saying:
"Cloud Run helps vivo quickly iterate AI applications and greatly reduces our operation and maintenance costs. The automatically scalable GPU service also greatly improves the efficiency of our AI going overseas.” – Guangchao Li, AI Architect, vivo
"L4 GPUs offer really strong performance at a reasonable cost profile. Combined with the fast auto scaling, we were really able to optimize our costs and saw an 85% reduction in cost. We’ve been very excited about the availability of GPUs on Cloud Run." – John Gill at Next’25, Sr. Software Engineer, Wayfair
"At Midjourney, we have found Cloud Run GPUs to be incredibly valuable for our image processing tasks. Cloud Run has a simple developer experience that lets us focus more on innovation and less on infrastructure management. Cloud Run GPU’s scalability also lets us easily analyze and process millions of images." – Sam Schickler, Data Team Lead, Midjourney
Get started today
Cloud Run with GPU is ready to power your next generation of applications. Dive into the documentation, explore our quickstarts, and review our best practices for optimizing model loading. We can’t wait to see what you build!
AI Summary and Description: Yes
Summary: The announcement details the general availability of NVIDIA GPU support for Google Cloud’s serverless runtime, Cloud Run. This enhancement allows developers to leverage powerful GPU capabilities for various applications while optimizing costs, simplifying deployment, and ensuring scalability.
Detailed Description:
The integration of NVIDIA GPU support in Google Cloud Run represents a transformative advancement for developers in AI, significantly enhancing their ability to deploy and manage applications efficiently.
**Key Points:**
– **Cost Efficiency**:
– Pay-per-second billing allows users to only pay for the GPU resources they consume.
– Automatic scaling to zero when no requests are received, eliminating idle costs for sporadic workloads.
– **Performance and Speed**:
– Rapid startup and scaling: From zero to an instance with a GPU in under 5 seconds.
– Impressive Time-to-First-Token for AI models, showcasing the system’s quick responsiveness.
– **Enhanced Functionality**:
– Full streaming support: Enables interactive applications through HTTP and WebSocket streaming.
– Full availability of NVIDIA L4 GPUs with no quota requests required, lowering the entry barrier for developers.
– **Reliability and Redundancy**:
– Production readiness backed by a Service Level Agreement (SLA) that assures reliability and uptime.
– Options for zonal redundancy or best-effort failover to manage costs during outages.
– **Global Usability**:
– Available across multiple Google Cloud regions including the US, Europe, and Asia.
– One-command deployment simplifies the process of serving global users with lower latency.
– **New Use Cases**:
– Supports batch processing and asynchronous tasks such as model fine-tuning, batch AI inferencing, and media processing.
– **Early Adoption Feedback**:
– Positive testimonials from users like vivo, Wayfair, and Midjourney highlight the operational efficiencies and cost savings achieved using Cloud Run GPUs.
– **Getting Started**:
– Encouragement for developers to explore Cloud Run with GPU features through documentation and quickstarts.
Overall, the introduction of GPU support in Cloud Run not only streamlines the process of deploying AI applications but also enhances their performance capabilities, making advanced AI workloads more accessible and cost-effective for a wider range of developers.