Cloud Blog: Multimodal agents tutorial: How to use Gemini, Langchain, and LangGraph to build agents for object detection

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/build-multimodal-agents-using-gemini-langchain-and-langgraph/
Source: Cloud Blog
Title: Multimodal agents tutorial: How to use Gemini, Langchain, and LangGraph to build agents for object detection

Feedly Summary: Here’s a common scenario when building AI agents that might feel confusing: How can you use the latest Gemini models and an open-source framework like LangChain and LangGraph to create multimodal agents that can detect objects? 
Detecting objects is critically important for use cases from content moderation to multimedia search and retrieval. Langchain provides tools to chain together LLM calls and external data. LangGraph provides a graph structure to build more controlled and complex multiagents apps. 
In this post, we’ll show you which decisions you need to make to combine Gemini, LangChain and LangGraph to build multimodal agents that can identify objects. This will provide a foundation for you to start building enterprise use cases like: 

Content moderation: Advertising policies, movie ratings, brand infringement

Object identification: Using different sources of data to verify if an object exist on a map

Multimedia search and retrieval: Finding files that contains a specific object

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

First decision: No-code/low-code, or custom agents?
The first decision enterprises have to decide is:  no-code/low-code options or build custom agents? If you are building a simple agent like a customer service chat bot, you can use Google’s Vertex AI Agent Builder to build a simple agent in a few minutes or start from pre-built agents that are available in Google Agentspace Agent Gallery. 
But if your use case requires  orchestration of multiple agents and integration with custom tooling, you would have to build custom agents which leads to the next question. 
Second decision: What agentic framework to use?
It’s hard to keep up with so many agentic frameworks out there releasing new features every week. Top contenders include CrewAI, Autogen, LangGraph and Google’s ADK. Some of them, like ADK and CrewAI, have higher levels of abstraction while others like LangGraph allow higher degree of control. 
That’s why in this blog, we center the discussion on building a custom agent using the open-sourced LangChain, LangGraph as an agentic framework, and Gemini 2.0 Flash as the LLM brain. 
Code deep dive
This example code identifies an object in an image, in an audio file, and in a video. In this case we will use a dog as the object to be identified. We have different agents (image analysis agent, audio analysis agent, and a video analysis agent) performing different tasks but all working together towards a common goal, object identification.

Generative AI workflow for object detection

This gen AI workflow entails a user asking the agent to verify if a specific object exists in the provided files.  The Orchestrator Agent will call relevant worker agents: image_agent, audio_agent, and video_agent while passing the user question and the relevant files. Each worker agent will call respective tooling to convert the provided file to base64 encoding. The final finding of each agent is then passed back to the Orchestrator Agent. The Orchestrator Agent then synthesizes the findings and makes the final determination. This code can be used as a starting point template where you need to ask an agent to reason and make a decision or generate conclusions from different sources.
If you want to create multiagent systems with ADK, here is a video production agent built by a Googler which generates video commercials from user prompts and utilizes Veo for video content generation, Lyria for composing music, and Google Text-to-Speech for narration. This example demonstrates the fact that many ingredients can be used to meet your agentic goals, in this case an AI agent as a production studio. If you want to try ADK, here is an ADK Quickstart to help you kick things off. 
Third decision: Where to deploy the agents? 
If you are building a simple app that needs to go live quickly, Cloud Run is an easy way to deploy your app. Just like any serverless web app, you can follow the same instructions to deploy on Cloud Run. Watch this video of building AI agents on Cloud Run. However, if you want more enterprise grade managed runtime, quality and evaluation, managing context and monitoring, Agent Engine is the way to go. Here is a quick start for Agent Engine. Agent Engine is a fully managed runtime which you can integrate with many of the previously mentioned frameworks – ADK, LangGraph, Crew.ai, etc (see the image below, from the official Google Cloud Docs).

Get started
Building intelligent agents with generative AI, especially those capable of multimodal understanding, is akin to solving a complex puzzle. Many developers are finding that a prototypical agentic build involves a LangChain agent with Gemini Flash as the LLM. This post explored how to combine the power of Gemini models with open-source frameworks like LangChain and LangGraph. To get started right away, use this ADK Quickstart and or visit our Agent Development GitHub.

AI Summary and Description: Yes

Summary: The text discusses the creation of multimodal AI agents using the Gemini models and open-source frameworks LangChain and LangGraph. It highlights practical applications in areas such as content moderation, object identification, and multimedia search and retrieval, while also addressing key decisions for enterprises when building AI agents.

Detailed Description:
The text provides valuable insights into the development of AI agents that can perform tasks such as object detection using generative AI technologies. The integration of frameworks like LangChain and LangGraph with Gemini models emphasizes the importance of multimodal capabilities for real-world applications. Here are the major points covered:

– **Multimodal Agents:** The focus is on building agents capable of understanding and processing multiple forms of data (e.g., images, audio, video) simultaneously, which is essential for applications like:
– **Content Moderation:** Managing advertising policies, movie ratings, and brand infringement.
– **Object Identification:** Verifying the existence of objects depicted in various data formats.
– **Multimedia Search and Retrieval:** Locating files based on specific object content.

– **Decisions for Building Agents:**
– **No-code/Low-code vs. Custom Agents:** Enterprises must decide whether to use simple low-code options for basic applications, like chatbots, or opt for custom-built agents requiring more sophisticated orchestration.
– **Selecting the Agentic Framework:** Given the rapid development of frameworks, developers have options such as CrewAI, Autogen, LangGraph, and Google’s ADK, each offering different levels of abstraction and control.

– **Technical Implementation:**
– The text includes a code deep dive, illustrating how to identify an object across different media types (image, audio, video) using separate agents working collaboratively toward a shared goal.
– **Generative AI Workflow:** The workflow mapping involves an orchestrator agent and various worker agents tasked with converting inputs into formats suitable for analysis, culminating in a collective conclusion.

– **Deployment Considerations:**
– Recommendations are made regarding deployment options, such as leveraging Google Cloud’s Cloud Run for quick and simple applications versus utilizing the Agent Engine for enterprise-grade solutions that require enhanced management, monitoring, and evaluation capabilities.

Key Takeaways for Security and Compliance Professionals:
– The development of multimodal AI agents necessitates strategic planning around data security, given the diverse types of data being processed.
– The choice of agent frameworks and deployment strategies may have implications for data governance and compliance with regulations governing data use and privacy.
– As organizations integrate AI into workflows, they must remain vigilant about potential vulnerabilities introduced by using diverse data sources and technologies.

This exploration of AI agent development underscores the intersection of innovation in AI technology and the need for robust security and compliance frameworks in enterprise applications.