Cloud Blog: Tutorial: How to use the Gemini Multimodal Live API for QA

Aug 12, 2025

—

Source URL: https://cloud.google.com/blog/topics/developers-practitioners/gemini-live-api-real-time-ai-for-manufacturing/
Source: Cloud Blog
Title: Tutorial: How to use the Gemini Multimodal Live API for QA

Feedly Summary: The Gemini Multimodal Live API is a powerful tool that allows developers to stream data, such as video and audio, to a generative AI model and receive responses in real-time. Unlike traditional APIs that require a complete data upload before processing can begin, this “live" or "streaming" capability enables a continuous, two-way conversation with the AI, making it possible to analyze events as they unfold.This real-time interaction unlocks a new class of applications, transforming AI from a static analysis tool into a dynamic, active participant in live workflows. The ability to process and reason over multiple data types (e.g., seeing video, reading text, understanding audio) simultaneously allows for complex, context-aware tasks that were previously impossible to automate.An example of how the Live API can work is in high-speed manufacturing. This tutorial will show you how to leverage the Gemini API to build an automated quality inspection system that overcomes the common challenges of manual QA.In this blog, you will learn how to create a system that uses a standard camera feed to:Analyze products on a production line in real-time.Identify products by reading barcodes or QR codes.Detects, classifies, and measures visual defects simultaneously.Generate structured reports for every defect.Trigger instant alerts for severe issues.PrerequisitesA Google Cloud Platform (GCP) account with billing enabled.Familiarity with basic cloud concepts and services like Cloud Run and BigQuery.Basic knowledge of Python.An enabled Gemini API key.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/welcome’), (‘image’, None)])]>

System architecture
The architecture for this system is designed to be serverless, scalable, and resilient, built on the robust foundation of Google Cloud services. It is composed of two primary microservices running on Cloud Run.
Here is the step-by-step workflow:
1. Data ingestion: A standard IP camera positioned over the assembly line streams video of products passing by. This feed is directed to the primary Cloud Run service.
2. Inspection service (Cloud Run): This containerized application is the brain of the operation.

Gemini Multimodal Live API: The service streams the video data to Gemini. It uses a dynamic prompt that might be pulled from a database, allowing it to apply different inspection criteria for different products on the same line. Gemini processes the stream, executing the multimodal task of reading the product ID and performing the visual inspection in real-time.

The service then formats the rich output from Gemini—including product ID, defect type, measurements, and location—into a structured JSON object for downstream processing.

3. Alerting & logging service (Cloud Run): This second Cloud Run service ingests the structured JSON data and handles all reporting, logging, and notification tasks.

Data logging: The service immediately writes the detailed defect record to BigQuery. This creates a historical, queryable database of all quality events, essential for long-term analytics.

Gemini 2.5 Flash: This model provides an additional layer of reasoning. It can take the raw defect data and summarize it into a concise, human-readable alert message. More advanced logic can be applied, such as correlating recent events:

Secret Manager: All API keys and credentials for notification services are securely stored and managed here, adhering to best practices for security.

Notification APIs (e.g., Gmail API, Google Chat API): Based on the severity and rules processed by Gemini, the service calls the appropriate APIs to dispatch alerts to the right people at the right time.

Step-by-step implementation
Step 1: Set up the inspection service.
This service is the core of our system. It’s a containerized application deployed on Cloud Run that receives the video feed and interacts with Gemini. The key is the prompt sent to the Gemini API. It instructs the model to perform multiple tasks on a single video frame. This leverages Gemini’s powerful multimodal capabilities.
Sample prompt:

code_block
<ListValue: [StructValue([(‘code’, ‘You are a quality control inspector for high-end electronics. In this video frame:\r\n1. Identify the product SKU by decoding the QR code.\r\n2. Inspect the brushed aluminum casing for any defects, specifically looking for:\r\n – Scratches longer than 2mm\r\n – Dents or dings\r\n – Discoloration or blemishes\r\n3. For each defect found, provide its type, location on the casing, and estimated dimensions in millimeters.\r\n4. Return your findings as a single, structured JSON object.’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7d90ce4c40>)])]>

When a defect is found, Gemini doesn’t just flag it; it provides quantitative data (e.g., length of a scratch). This data can be used to calculate a severity score based on weighted parameters like defect size, type, and location, removing human subjectivity from the process.Step 2: Configure the alerting & logging serviceThis second Cloud Run service acts on the data from the inspection service.Logging: Upon receiving the structured JSON output from the first service, it immediately writes the full record to a BigQuery table. This creates a powerful, queryable history of all quality events.Intelligent Alerting: For high-severity defects, the service uses a model like Gemini Flash to perform an additional reasoning step. It can summarize the technical data into a clear alert or even correlate events to identify trends.Sample prompt:

code_block
<ListValue: [StructValue([(‘code’, ‘Given the following defect data, generate a concise, critical alert message for a line supervisor. Correlate this event with the provided history of recent defects.\r\n\r\nDefect Data:\r\n{\r\n "product_sku": "SKU-XT789-BLK",\r\n "line_id": "Line 4",\r\n "machine_id": "M-7",\r\n "defect_type": "Housing Crack",\r\n …\r\n}\r\n\r\nRecent History:\r\n[\r\n {"timestamp": "2025-07-15T14:20:11Z", "defect_type": "Housing Crack"},\r\n {"timestamp": "2025-07-15T14:15:32Z", "defect_type": "Housing Crack"}\r\n]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e7d90ce4100>)])]>

Generated alert: CRITICAL ALERT: 3rd ‘Housing Crack’ defect detected on Line 4 in the last 10 minutes. Possible systemic issue with molding machine M-7. Based on the alert’s severity, the service calls the appropriate APIs (e.g., Gmail API, Google Chat API) to send the message to the right stakeholders instantly.Get startedBy leveraging Gemini’s multimodal capabilities on a scalable Google Cloud architecture, you can build a powerful quality intelligence platform. Check out these links to get started:Read the full Live API Capabilities guide for key capabilities and configurations; including Voice Activity Detection and native audio features.Read the Tool use guide to learn how to integrate Live API with tools and function calling.Read the Session management guide for managing long running conversations.

AI Summary and Description: Yes

**Summary:** The text describes the capabilities and architecture of the Gemini Multimodal Live API, emphasizing its design for real-time data streaming to generative AI models. This innovation signifies a shift from traditional APIs, enabling dynamic interactions that can transform applications in sectors like manufacturing, especially with automated quality inspections.

**Detailed Description:**
The Gemini Multimodal Live API facilitates real-time engagement with generative AI models, allowing developers to stream various data types (video, audio, text) and receive immediate analyses and responses. This advancement is crucial for industries such as manufacturing, where real-time decision-making can enhance operational efficiency and product quality.

Key points include:

– **Real-Time Data Streaming:**
– Unlike traditional APIs requiring complete data uploads, the Live API enables continuous two-way interactions.
– This functionality allows for immediate analysis and decision-making based on streaming data.

– **Multimodal Capabilities:**
– The API can simultaneously process multiple data types (e.g., video feeds for visual inspection and QR code decoding), which enhances the complexity and scope of tasks it can handle.

– **Automated Quality Inspection Use Case:**
– A practical application is demonstrated with an automated quality inspection system in manufacturing.
– The system processes video data from a standard IP camera, analyzes products on a production line, identifies them, detects defects, and generates structured reports.

– **System Architecture:**
– Built on a serverless infrastructure utilizing Google Cloud services, specifically two microservices running on Cloud Run:
1. **Inspection Service:** Analyzes video in real-time, sending results to the Gemini API for multimodal tasks.
2. **Alerting & Logging Service:** Collects data from the inspection service, logs it in BigQuery for analytics, and generates alerts based on severity.

– **Data Structure and Notifications:**
– The API generates structured JSON outputs, which are essential for maintaining detailed records and alerting systems, aiding organizations in tracking quality events over time.
– Secret management for API keys enhances security by following best practices for sensitive data handling.

– **Implementation Steps:**
– Step-by-step guidance on configuring the inspection and alerting services for effective operation and alert generation based on defect detection.

Overall, the Gemini Multimodal Live API represents a significant advancement in AI and cloud computing security by supporting dynamic, real-time processing while requiring practitioners to consider the implications for data privacy, compliance, and integrated security controls within their AI-driven applications.

1 10 2 2025 3 3rd 4 5 5 flash 7 a account Act actions ads advanced advancement age AGI AI ai model AI models alerting systems alerts All analysis analytics and anti API API keys APIs app Application applications Arch architecture Arize art as at ated audio Auto Automated Quality Inspection aware based Best best practices Bi BigQuery brain building built by C calling capabilities capability challenge challenges chat CI CIA class CleaR Cloud cloud architecture cloud computing cloud computing security cloud platform Cloud Run cloud service cloud services co code coding Col complexity compliance Compose Computing concept Configuration configurations Console container containerized containerized application Context continuous control controls conversation core credential credentials criteria critical D data Data Handling Data Ingestion data logging data privacy data streaming data structure Data Type data types data upload database de decision decision-making decoding demo design detection developer developers drive driven driven applications e edge effective efficiency electro end engagement event fact feature features first flash following for free full function function calling functionality g Gemini Gemini 2 Gemini Pro Gen generated generation generative Generative AI generative AI models Gmail Go Google Google Cloud Google Cloud Platform Google Cloud services gs guidance H handling high high-speed HR http HTTPS human image implementation implications in infrastructure innovation inspection integrated security Intel intelligence inter interaction interactions io IRS issue ite J json Just k Key keys knowledge l language led Li Link load logging logic logs long low M mac machine making man management manufacturing measures media Micro microservice microservices mini modal Mode model models multi Multimodal multimodal capabilities multimodal tasks N native new no non notifications o oE of on one ons operation operational operational efficiency organization organizations ory oS oss out output Outputs over parameter Patch per platform point porting Power practices pre privacy pro process processes processing product production products prompt ps Py Python Q QR code QR codes quality quality control R rack rag rate RCE re reading real real-time real-time data real-time processing reasoning record red report reporting Resil response responses return right Ro RSA s sam scalable scope sec secret management Secret Manager sector secure security security controls sensitive data sensitive data handling server serverless serverless infrastructure service services session management severity shift side Sig Sim single size source specific speed SSE stakeholders STAR start static analysis stored Streaming streaming data structured support system system architecture systems T Task tasks tech technical ted text the Time time data time Data Streaming time decision time Interaction time processing times to tool tool use tools Tor TP tracking Transform trends trial trie turn two type UI under up US use V val vents video video data voice Ware weight Wi workflow workflows x yt z