Source URL: https://cloud.google.com/blog/products/ai-machine-learning/serverless-ai-with-gemma-3-on-cloud-run/
Source: Cloud Blog
Title: How to deploy serverless AI with Gemma 3 on Cloud Run
Feedly Summary: Today, we introduced Gemma 3, a family of lightweight, open models built with the cutting-edge technology behind Gemini 2.0. The Gemma 3 family of models have been designed for speed and portability, empowering developers to build sophisticated AI applications at scale. Combined with Cloud Run, it has never been easier to deploy your serverless workloads with AI models.
In this post, we’ll explore the functionalities of Gemma 3, and how you can run it on Cloud Run.
Gemma 3: Power and efficiency for Cloud deployments
Gemma 3 is engineered for exceptional performance with lower memory footprints, making it ideal for cost-effective inference workloads.
Built with the world’s best single-accelerator model: Gemma 3 delivers optimal performance for its size, outperforming Llama-405B, DeepSeek-V3 and o3-mini in preliminary human preference evaluations on LMArena’s leaderboard. This helps you to create engaging user experiences that can fit on a single GPU or TPU.
Create AI with advanced text and visual reasoning capabilities: Easily build applications that analyze images, text and short videos, opening up possibilities for interactive applications.
Handle complex tasks with a large context window: Gemma 3 offers a 128k-token context window to let your applications process and understand vast amounts of information — even entire novels — enabling more sophisticated AI capabilities..
aside_block
Serverless inference with Gemma 3 and Cloud Run
Gemma 3 is a great fit for inference workloads on Cloud Run using Nvidia L4 GPUs. Cloud Run is Google Cloud’s fully managed serverless platform, helping developers leverage container runtimes without having to concern themselves with the underlying infrastructure. Models scale to zero when inactive, and scale dynamically with demand. Not only does this optimize costs and performance, but you only pay for what you use.
For example, you could host an LLM on one Cloud Run service and a chat agent on another, enabling independent scaling and management. And with GPU acceleration, a Cloud Run service can be ready with the first AI inference results in under 30 seconds, with only 5 seconds to start an instance. This rapid deployment ensures that your applications deliver responsive user experiences. We also reduced the GPU price in Cloud Run down to ~$0.6/hr. And of course, if your service isn’t receiving requests, it will scale down to zero.
Get started today
Cloud Run and Gemma 3 combine to create a powerful, cost-effective, and scalable solution for deploying advanced AI applications. Gemma 3 is supported by a variety of tools and frameworks, such as Hugging Face Transformers, Ollama, and vLLM.
To get started, visit this guide which will show you how to build a service with Gemma 3 on Cloud Run with Ollama.
AI Summary and Description: Yes
Summary: The introduction of Gemma 3, a family of lightweight AI models optimized for performance and portability, signifies a notable advancement in AI application development, particularly when deployed via Google’s Cloud Run. The integration of serverless architecture enhances efficiency and responsiveness, making it easier for developers to manage and scale AI workloads.
Detailed Description:
Gemma 3 represents a significant step forward in the deployment and efficiency of AI models, particularly tailored for cloud environments. Key points regarding Gemma 3 include:
– **Performance and Portability**:
– Built on the cutting-edge technology of Gemini 2.0, Gemma 3 offers lightweight model architecture, making it suitable for a variety of AI applications that require rapid deployment and low infrastructure overhead.
– Designed for efficient inference workloads, Gemma 3 has a lower memory footprint yet is optimized for performance, making it economically viable for developers seeking to minimize costs.
– **Comparative Advantage**:
– In performance evaluations, Gemma 3 has shown superior capabilities compared to established models like Llama-405B and DeepSeek-V3. This is crucial for developers looking for out-of-the-box solutions for high-performance AI tasks.
– **Advanced AI Functionality**:
– Gemma 3 facilitates the creation of applications that leverage both text and visual reasoning, allowing developers to build interactive applications that can analyze diverse data formats like images, text, and video.
– The model supports a substantial 128k-token context window, enabling applications to comprehend extensive datasets, further enhancing their sophistication and utility.
– **Serverless Deployment with Cloud Run**:
– Gemma 3 is compatible with Google Cloud’s Cloud Run platform, a fully managed serverless solution. This allows developers to deploy models without the complexity of managing underlying infrastructures.
– Features include the ability for models to scale down to zero when not in use, ensuring cost efficiency, as users are billed only for actual usage.
– The rapid response times of under 30 seconds for AI inference results significantly enhance user experience, critical for applications requiring immediacy.
– **Pricing and Accessibility**:
– The competitive pricing structure for GPU usage on Cloud Run, now around ~$0.6/hr, lowers the barrier for entry for developers wanting to experiment or deploy using AI models.
– **Integration and Support**:
– Gemma 3 can leverage popular machine learning libraries and frameworks such as Hugging Face Transformers and Ollama, which is essential for compatibility and ease of integration into existing workflows.
Overall, Gemma 3’s combination of lightweight architecture, advanced capabilities, and cloud deployment aligns well with the growing demand for scalable and cost-effective AI solutions, making it a significant development for practitioners in AI, cloud computing, and infrastructure security.