Cloud Blog: Introducing agent evaluation in Vertex AI Gen AI evaluation service

Jan 24, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/introducing-agent-evaluation-in-vertex-ai-gen-ai-evaluation-service/
Source: Cloud Blog
Title: Introducing agent evaluation in Vertex AI Gen AI evaluation service

Feedly Summary: Comprehensive agent evaluation is essential for building the next generation of reliable AI. It’s not enough to simply check the outputs; we need to understand the “why" behind an agent’s actions – its reasoning, decision-making process, and the path it takes to reach a solution.
That’s why today, we’re thrilled to announce Vertex AI Gen AI evaluation service is now in public preview. This new feature empowers developers to rigorously assess and understand their AI agents. It includes a powerful set of evaluation metrics specifically designed for agents built with different frameworks, and provides native agent inference capabilities to streamline the evaluation process.
In this post, we’ll explore how evaluation metrics work and share an example of how you can apply this to your agents.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Evaluate agents using Vertex AI Gen AI evaluation service
Our evaluation metrics can be grouped in two categories: final response and trajectory evaluation.
Final response asks a simple question: does your agent achieve its goals? You can define custom final response criteria to measure success according to your specific needs. For example, you can assess whether a retail chatbot provides accurate product information or if a research agent summarizes findings effectively, using appropriate tone and style.
To look below the surface, we offer trajectory evaluation to analyze the agent’s decision-making process. Trajectory evaluation is crucial for understanding your agent’s reasoning, identifying potential errors or inefficiencies, and ultimately improving performance. We offer six trajectory evaluation metrics to help you answer these questions:
1. Exact match: Requires the AI agent to produce a sequence of actions (a "trajectory") that perfectly mirrors the ideal solution.
2. In-order match: The agent’s trajectory needs to include all the necessary actions in the correct order, but it might also include extra, unnecessary steps. Imagine following a recipe correctly but adding a few extra spices along the way.
3. Any-order match: Even more flexible, this metric only cares that the agent’s trajectory includes all the necessary actions, regardless of their order. It’s like reaching your destination, regardless of the route you take.
4. Precision: This metric focuses on the accuracy of the agent’s actions. It calculates the proportion of actions in the predicted trajectory that are also present in the reference trajectory. A high precision means the agent is making mostly relevant actions.
5. Recall: This metric measures the agent’s ability to capture all the essential actions. It calculates the proportion of actions in the reference trajectory that are also present in the predicted trajectory. A high recall means the agent is unlikely to miss crucial steps.
6. Single-tool use: This metric checks for the presence of a specific action within the agent’s trajectory. It’s useful for assessing whether an agent has learned to utilize a particular tool or capability.
Compatibility meets flexibility
Vertex AI Gen AI evaluation service supports a variety of agent architectures.
With today’s launch, you can evaluate agents built with Reasoning Engine (LangChain on Vertex AI), the managed runtime for your agentic applications on Vertex AI. We also support agents built by open-source frameworks, including LangChain, LangGraph, and CrewAI – and we are planning to support upcoming Google Cloud services to build agents.
For maximum flexibility, you can evaluate agents using a custom function that processes prompts and returns responses. To make your evaluation experience easier, we offer native agent inference and automatically log all results in Vertex AI experiments.
Agent evaluation in action
Let’s say you have the following LangGraph customer support agent, and you aim to assess both the responses it generates and the sequence of actions (or "trajectory") it undertakes to produce those responses.

To assess an agent using Vertex AI Gen AI evaluation service, you start preparing an evaluation dataset. This dataset should ideally contain the following elements:

User prompt: This represents the input that the user provides to the agent.

Reference trajectory: This is the expected sequence of actions that the agent should take to provide the correct response.

Generated trajectory: This is the actual sequence of actions that the agent took to generate a response to the user prompt.

Response: This is the generated response, given the agent’s sequence of actions.

A sample evaluation dataset is shown below.

After you gather your evaluation dataset, define the metrics that you want to use to evaluate the agent. For a complete list of metrics and their interpretations, refer to Evaluate Gen AI agents. Some metrics you can define are listed here:

code_block
<ListValue: [StructValue([(‘code’, ‘response_tool_metrics = [\r\n "trajectory_exact_match", "trajectory_in_order_match", "safety", response_follows_trajectory_metric\r\n]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5a6851abb0>)])]>

Notice that the response_follows_trajectory_metric is a custom metric that you can define to evaluate your agent.
Standard text generation metrics, such as coherence, may not be sufficient when evaluating AI agents that interact with environments, as these metrics primarily focus on text structure. Agent responses should be assessed based on their effectiveness within the environment. Vertex AI Gen AI Evaluation service allows you to define custom metrics, like response_follows_trajectory_metric, that assess whether the agent’s response logically follows from its tool choices. For more information on these metrics, please refer to the official notebook.
With your evaluation dataset and metrics defined, you can now run your first agent evaluation job on Vertex AI. Please see the code sample below.

code_block
<ListValue: [StructValue([(‘code’, ‘# Import libraries \r\nimport vertexai\r\nfrom vertexai.preview.evaluation import EvalTask\r\n\r\n# Initiate Vertex AI session\r\nvertexai.init(project="my-project-id", location="my-location", experiment="evaluate-langgraph-agent)\r\n\r\n# Define an EvalTask\r\nresponse_eval_tool_task = EvalTask(\r\n dataset=byod_eval_sample_dataset,\r\n metrics=response_tool_metrics,\r\n)\r\n\r\n# Run evaluation\r\nresponse_eval_tool_result = response_eval_tool_task.evaluate( experiment_run_name="response-over-tools")’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5a67df09d0>)])]>

To run the evaluation, initiate an `EvalTask` using the predefined dataset and metrics. Then, run an evaluation job using the evaluate method. Vertex AI Gen AI evaluation tracks the resulting evaluation as an experiment run within Vertex AI Experiments, the managed experiment tracking service on Vertex AI. The evaluation results can be viewed both within the notebook and the Vertex AI Experiments UI. If you’re using Colab Enterprise, you can also view the results in the Experiment side panel as shown below.

Vertex AI Gen AI evaluation service offers summary and metrics tables, providing detailed insights into agent performance. This includes individual user input, trajectory results, and aggregate results for all user input and trajectory pairs across all requested metrics.
Access to these granular evaluation results enables you to create meaningful visualizations of agent performance, including bar and radar charts like the one below:

Get started today
Explore the Vertex AI Gen AI evaluation service in public preview and unlock the full potential of your agentic applications.
Documentation

Evaluate gen AI agents

Notebooks

Evaluating a LangGraph agent

Evaluating a Crew AI agent

Evaluating LangChain agent on Vertex AI Reasoning Engine

Evaluating LangGraph agent on Vertex AI Reasoning Engine

Evaluating CrewAI agent on Vertex AI Reasoning Engine

AI Summary and Description: Yes

Summary: The text introduces the Vertex AI Gen AI evaluation service, which is in public preview, designed to aid developers in assessing AI agents more comprehensively than just output evaluation. It highlights the importance of understanding an agent’s decision-making process and offers a range of evaluation metrics to enhance the reliability and effectiveness of AI applications.

Detailed Description:
The introduction of the Vertex AI Gen AI evaluation service represents a significant advancement in AI evaluation frameworks, focusing on comprehensive metrics that not only assess output but also delve into the reasoning and decision-making processes behind AI agents. Here are the key highlights:

* **Enhanced Evaluation Metrics**:
– The evaluation service introduces two primary categories of metrics: **final response evaluation** and **trajectory evaluation**.
– Final response metrics determine if an agent achieves its goals based on customizable success criteria.
– Example: Assessing if a chatbot provides accurate product details.
– Trajectory evaluation examines the sequence of actions an agent takes, which is vital for identifying errors and improving performance. Six distinct trajectory evaluation metrics are detailed:
1. **Exact Match**: Requires actions to exactly mirror the ideal solution.
2. **In-Order Match**: Actions must follow the correct sequence but may include unnecessary steps.
3. **Any-Order Match**: Only requires the presence of all necessary actions, irrespective of their order.
4. **Precision**: Measures the relevance of actions taken as compared to the reference.
5. **Recall**: Assesses the agent’s ability to perform all essential actions.
6. **Single-tool Use**: Checks if a specific action is present in the agent’s trajectory.

* **Compatibility and Flexibility**:
– The service supports various agent architectures including Reasoning Engine and open-source frameworks like LangChain and CrewAI, allowing a broad range of agents to be evaluated.

* **Custom Evaluation Framework**:
– Developers can define their own metrics that go beyond standard text generation measures, tailoring evaluations to specific use cases and environments.
– The platform provides built-in functionalities for logging and analyzing results, streamlining the process of evaluating and improving AI agents.

* **Practical Implementation**:
– The text includes guidance on preparing an evaluation dataset, defining metrics, and an example evaluation code snippet. This practical approach enables developers to effectively use the evaluation service.

* **Visual Representation of Results**:
– The service facilitates summary and metrics tables that present detailed insights into agent performance. Visualization tools such as bar and radar charts can help in presenting the evaluation results clearly.

In summary, the Vertex AI Gen AI evaluation service is positioned as a vital tool for developers aiming to ensure the reliability, transparency, and effectiveness of their AI agents, ultimately enhancing the field of AI deployment in various applications.

1 2 3 4 5 7 a access accuracy Act advancement agent agent performance agents AGI AI AI applications and Application applications Arch architecture architectures Arize art as Auto based by C capabilities chain chat Chatbot CIA CleaR Cloud cloud service cloud services code coherence Col compatibility Console cross Customer customer support customizable D data dataset day de decision decision-making Decision-making Processes DeFi deployment design developer developers document documentation dual e e-learning effective effectiveness enterprise environment ERP errors evaluation evaluation framework Evaluation frameworks Evaluation Metrics evaluation service exp experience face fine first flexibility for framework frameworks full g Gen generated generation Go Google Google Cloud Google Cloud services graph Group gs guidance high Highlight HR http HTTPS image implementation in Inference inference capabilities information insights inter interpret IRS ite job Just k l LangChain language learning led liability libraries logging logic long low mac machine making making processes max metrics Mir ML my nation native next no non notebook NPU o oE of off on one open open-source ory out Outputs over performance planning post Power pre precision Preview processes product products prompt prompts public question R rack rate RCE reasoning recall red reliability representation research response retail return Ro s s Position safety search sequence service service support services SHA side Sig Sim Simple single source SSE start T Tails Task text text generation the Time to tool tool use tools Tor TP tracking trajectory transparency trial two UI up US use use cases user V val Valuation Vertex Vertex AI visualization visualizations WAN Wi x XAI