Cloud Blog: Unlock Inference-as-a-Service with Cloud Run and Vertex AI

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/improve-your-gen-ai-app-velocity-with-inference-as-a-service/
Source: Cloud Blog
Title: Unlock Inference-as-a-Service with Cloud Run and Vertex AI

Feedly Summary: It’s no secret that large language models (LLMs) and generative AI have become a key part of the application landscape. But most foundational LLMs are consumed as a service, meaning they’re hosted and served by a third party and accessed via APIs. Ultimately, this reliance on external APIs creates bottlenecks for developers.
There are many proven ways to host applications. Until lately, the same couldn’t be said of the LLMs those applications depend on. To improve velocity, developers can consider an approach called Inference-as-a-Service. Let’s explore how this approach can drive your LLM-powered applications.
What is Inference-as-a-Service?
When it comes to the cloud, everything is a service. For example, rather than buying physical servers to host your applications and databases, cloud providers use them as a metered service. The key word here is “metered.” As an end user, you pay online for the compute time and storage you use. Phrases such as “Software-as-a-Service”, “Platform-as-a-Service”, and “Functions-as-a-Service” have been in the cloud glossary for over a decade. 
With “Inference-as-a-Service”, an enterprise application interfaces with the machine learning model (in this case, the LLM), with low operational overhead. This means you can run your code to interface with the LLM without focusing on infrastructure.
Why Cloud Run for Inference-as-a-Service
Cloud Run is Google Cloud’s serverless container platform. In short, it helps developers leverage container runtimes without having to concern themselves with the infrastructure. Historically, serverless has centered around functions. This is why Cloud Run is a good fit for driving your LLM-powered applications – you only pay when the service is running. 
There are many ways to use Cloud Run to inference with LMMs. Today, we’ll explore how to host open LLMs on Cloud Run with GPUs. 
First, get familiar with Vertex AI. Vertex AI is Google Cloud’s all-in-one AI/ML platform that offers the primitives required for an enterprise to train and serve ML models. In Vertex AI, you can access Model Garden, which offers over 160 foundation models including first-party models (Gemini), third-party, and open source models. 
To inference with Vertex AI, activate the Gemini API first. You can use Vertex AI’s standard or express mode to inference. Then, by simply adding the right Google Cloud credentials into your application, you can deploy the application as a container on Cloud Run and it will seamlessly inference with Vertex AI. You can try this yourself with this GitHub sample.
While Vertex AI provides managed inference endpoints, Google Cloud also offers a new level of flexibility with GPUs for Cloud Run. This fundamentally shifts the inference paradigm. Why? Because instead of relying solely on Vertex AI’s infrastructure, you can now containerize your LLM (or other models) and deploy them directly to Cloud Run.
This means you’re not just building a serverless layer around an LLM, but you’re hosting the LLM itself on a serverless architecture. Models scale to zero when inactive, and scale dynamically with demand, optimizing costs and performance. For example, you could host an LLM on one Cloud Run service and a chat agent on another, enabling independent scaling and management. And with GPU acceleration, a Cloud Run service can be ready for inference in under 30 seconds.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Tailor your LLM with RAG
Beyond hosting and scaling LLMs, you’ll often need to tailor their responses to specific domains or datasets. This is where Retrieval-Augmented Generation (RAG) comes into play, a core component of extending your LLM experience – and one that’s quickly becoming the standard for contextual customization.
Think of it this way: LLMs are trained on broad datasets, but your applications need to leverage your data. RAG uses a vector database, like AlloyDB, to store embeddings of your private data. When your application queries an LLM, RAG retrieves relevant embeddings, providing the LLM with the necessary context to generate highly specific and accurate responses.
There are a few ways Inference-as-a-Service comes into play. For example, when looking at this architecture, we see that Cloud Run handles the core inference logic, orchestrating interactions between Vertex AI and AlloyDB. Specifically, it serves as the bridge for both fetching data from AlloyDB and passing queries to Vertex AI, effectively managing the entire RAG data flow.

Let’s take an example
Consider a chatbot architecture. The architecture below uses Cloud Run to host our chatbot. Our developer is able to write an application using common chatbot tools such as Streamlit and Langchain. It can then inference with LLMs hosted in the Vertex AI Model Garden (or it could use another Cloud Run instance) and then store embeddings into AlloyDB. This gives you a customizable gen AI chatbot – all on a serverless runtime.

Get started 
To get started, visit this codelab which will show you how to build a generative AI Python Application using Cloud Run. If you want to test out Cloud Run with GPUs, try out this codelab.

AI Summary and Description: Yes

**Summary:** The text discusses the emerging approach of Inference-as-a-Service in the context of large language models (LLMs) and generative AI, featuring Google Cloud’s Cloud Run and Vertex AI. It emphasizes how this model can enhance application performance by reducing operational overhead and enabling scalability.

**Detailed Description:** The text explores the concept of Inference-as-a-Service and its relevance to developers working with large language models (LLMs). Here’s a breakdown of the major points:

– **Rise of LLMs and Generative AI:**
– LLMs are increasingly integral to applications.
– Most foundational LLMs are accessed through third-party APIs, creating potential bottlenecks.

– **Inference-as-a-Service Overview:**
– Similar to other cloud services (like SaaS and PaaS), Inference-as-a-Service simplifies the interaction between applications and machine learning models.
– It allows developers to focus on application code rather than on infrastructure management.

– **Cloud Run Advantages:**
– Google Cloud’s serverless platform, Cloud Run, facilitates hosting applications without handling underlying infrastructure.
– It offers a pay-as-you-go model allowing cost-efficient use of resources.

– **Hosting LLMs on Cloud Run:**
– Developers can run LLM applications on Cloud Run with minimal operational effort; they can use Google Cloud’s Vertex AI for model management.
– GPU support enhances performance, allowing for rapid scaling and lowering costs when not in use.

– **Integration of Retrieval-Augmented Generation (RAG):**
– RAG enhances LLM responses by tailoring them to specific datasets using embeddings stored in a vector database.
– This method allows applications to leverage private data for more relevant generative responses.

– **Example Architecture – Chatbot Implementation:**
– The text provides an example architecture where a chatbot application can be built using Cloud Run with LLMs from Vertex AI and embeddings from AlloyDB, demonstrating a practical application of the discussed technologies.

– **Getting Started:**
– Resources like codelabs for building a generative AI application with Python on Cloud Run are suggested, encouraging developers to start implementing these strategies.

Overall, the text is highly relevant to security and compliance professionals as it delves into modern approaches in AI application development, the seamless use of cloud infrastructure, and the importance of efficiently managing data privacy aspects through tailored models. It emphasizes ongoing transformations in how LLMs are hosted and scaled, urging professionals to stay updated with these evolving methodologies.