Cloud Blog: How to deploy serverless AI with Gemma 3 on Cloud Run

Mar 12, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/serverless-ai-with-gemma-3-on-cloud-run/
Source: Cloud Blog
Title: How to deploy serverless AI with Gemma 3 on Cloud Run

Feedly Summary: Today, we introduced Gemma 3, a family of lightweight, open models built with the cutting-edge technology behind Gemini 2.0. The Gemma 3 family of models have been designed for speed and portability, empowering developers to build sophisticated AI applications at scale. Combined with Cloud Run, it has never been easier to deploy your serverless workloads with AI models.
In this post, we’ll explore the functionalities of Gemma 3, and how you can run it on Cloud Run.
Gemma 3: Power and efficiency for Cloud deployments
Gemma 3 is engineered for exceptional performance with lower memory footprints, making it ideal for cost-effective inference workloads.

Built with the world’s best single-accelerator model: Gemma 3 delivers optimal performance for its size, outperforming Llama-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena’s leaderboard. This helps you to create engaging user experiences that can fit on a single GPU or TPU.

Create AI with advanced text and visual reasoning capabilities: Easily build applications that analyze images, text and short videos, opening up possibilities for interactive applications.

Handle complex tasks with a large context window: Gemma 3 offers a 128k-token context window to let your applications process and understand vast amounts of information — even entire novels — enabling more sophisticated AI capabilities..

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Serverless inference with Gemma 3 and Cloud Run
Gemma 3 is a great fit for inference workloads on Cloud Run using Nvidia L4 GPUs. Cloud Run is Google Cloud’s fully managed serverless platform, helping developers leverage container runtimes without having to concern themselves with the underlying infrastructure. Models scale to zero when inactive, and scale dynamically with demand. Not only does this optimize costs and performance, but you only pay for what you use.
For example, you could host an LLM on one Cloud Run service and a chat agent on another, enabling independent scaling and management. And with GPU acceleration, a Cloud Run service can be ready with the first AI inference results in under 30 seconds, with only 5 seconds to start an instance. This rapid deployment ensures that your applications deliver responsive user experiences. We also reduced the GPU price in Cloud Run down to ~$0.6/hr. And of course, if your service isn’t receiving requests, it will scale down to zero.
Get started today
Cloud Run and Gemma 3 combine to create a powerful, cost-effective, and scalable solution for deploying advanced AI applications. Gemma 3 is supported by a variety of tools and frameworks, such as Hugging Face Transformers, Ollama, and vLLM.
To get started, visit this guide which will show you how to build a service with Gemma 3 on Cloud Run with Ollama.

AI Summary and Description: Yes

Summary: The introduction of Gemma 3, a family of lightweight AI models optimized for performance and portability, signifies a notable advancement in AI application development, particularly when deployed via Google’s Cloud Run. The integration of serverless architecture enhances efficiency and responsiveness, making it easier for developers to manage and scale AI workloads.

Detailed Description:

Gemma 3 represents a significant step forward in the deployment and efficiency of AI models, particularly tailored for cloud environments. Key points regarding Gemma 3 include:

– **Performance and Portability**:
– Built on the cutting-edge technology of Gemini 2.0, Gemma 3 offers lightweight model architecture, making it suitable for a variety of AI applications that require rapid deployment and low infrastructure overhead.
– Designed for efficient inference workloads, Gemma 3 has a lower memory footprint yet is optimized for performance, making it economically viable for developers seeking to minimize costs.

– **Comparative Advantage**:
– In performance evaluations, Gemma 3 has shown superior capabilities compared to established models like Llama-405B and DeepSeek-V3. This is crucial for developers looking for out-of-the-box solutions for high-performance AI tasks.

– **Advanced AI Functionality**:
– Gemma 3 facilitates the creation of applications that leverage both text and visual reasoning, allowing developers to build interactive applications that can analyze diverse data formats like images, text, and video.
– The model supports a substantial 128k-token context window, enabling applications to comprehend extensive datasets, further enhancing their sophistication and utility.

– **Serverless Deployment with Cloud Run**:
– Gemma 3 is compatible with Google Cloud’s Cloud Run platform, a fully managed serverless solution. This allows developers to deploy models without the complexity of managing underlying infrastructures.
– Features include the ability for models to scale down to zero when not in use, ensuring cost efficiency, as users are billed only for actual usage.
– The rapid response times of under 30 seconds for AI inference results significantly enhance user experience, critical for applications requiring immediacy.

– **Pricing and Accessibility**:
– The competitive pricing structure for GPU usage on Cloud Run, now around ~$0.6/hr, lowers the barrier for entry for developers wanting to experiment or deploy using AI models.

– **Integration and Support**:
– Gemma 3 can leverage popular machine learning libraries and frameworks such as Hugging Face Transformers and Ollama, which is essential for compatibility and ease of integration into existing workflows.

Overall, Gemma 3’s combination of lightweight architecture, advanced capabilities, and cloud deployment aligns well with the growing demand for scalable and cost-effective AI solutions, making it a significant development for practitioners in AI, cloud computing, and infrastructure security.

1 2 3 4 5 a acceleration accelerator access accessibility Act ads advanced AI advanced capabilities advancement agent AGI AI AI applications ai model AI models AI workloads and anti API Application application development applications Arch architecture art as Best board building by C capabilities CERN chat CIA Cloud cloud computing Cloud Deploy cloud deployment cloud deployments cloud environment cloud environments Cloud Run compatibility competitive competitive pricing complexity Computing Console container container runtime Context context window cost cost efficiency cost-effective Costs creation critical cutting D data data formats dataset datasets day de deep DeepSeek demand deployment design developer developers development e e-learning edge effective efficiency efficient end Engineer environment evaluation evaluations exp experience face feature features first for framework frameworks free full functionality g Gemini Gemini 2 Gemini 2.0 Gemma 3 Gen Go Google Google Cloud GPU GPUs H high high-performance HR http HTTPS hugging Hugging Face Hugging Face transformers human human preference evaluation IaC image in Inference inference workloads information infrastructure infrastructure security infrastructures integration inter IRS ite k Key l large learning led less deployment Li libraries lightweight llama llm lm low mac machine Machine Learning making man management media memory mini Mode model model architecture model support models N nation no nomic non Nvidia o o3 oE of off ollama on one open open models OPM opt optimal performance ory out over performance performance evaluation phi platform point portability post Power pre price pricing pricing structure process product products R rag rapid deployment rapid response RCE reasoning reasoning capabilities red response response times Ro s scalable Scale scaling sec security server serverless serverless architecture serverless platform service short side Sig single solutions source SSE start structures T Task tasks tech technology text the Time to token tool tools Tor TP transformer transformers trial UI up US usage use user user experience Users V V3 val Valuation Vantage Vertex video visual reasoning WAN Well Wi Wind workflow workflows workload workloads x zero