Cloud Blog: Evaluate gen AI models with Vertex AI evaluation service and LLM comparator

Feb 28, 2025

—

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/evaluate-ai-models-with-vertex-ai–llm-comparator/
Source: Cloud Blog
Title: Evaluate gen AI models with Vertex AI evaluation service and LLM comparator

Feedly Summary: It’s a persistent question: How do you know which generative AI model is the best choice for your needs? It all comes down to smart evaluation.
In this post, we’ll share how to perform pairwise model evaluations – a way of comparing two models directly against each other – using Vertex AI evaluation service and LLM Comparator. We’ll introduce each tool’s useful features, why the tools help us evaluate performance of LLMs, and how you can use it to create a robust evaluation framework.
Pairwise model evaluation to assess performance
Pairwise model evaluation means comparing two models directly against each other to assess their relative performance on a specific task. There are three main benefits to pairwise model evaluation for LLMs:

Make informed decisions: The increasing number and variety of LLMs means you need to carefully evaluate and choose the best model for your specific task. Considering the strengths and weaknesses of each option is table stakes.

Define “better” quantitatively: Generated content from generative AI models, such as natural language texts or images, are usually unstructured, lengthy, and difficult to evaluate automatically without human intervention. Pairwise helps define ”better” response close to human responses to each prompt with human inspection.

Keep an eye out: LLMs should be continuously retrained and tuned with the new data to be enhanced compared with the previous versions of them and other latest models.

The proposed evaluation process for LLMs.

Vertex AI evaluation service
The Gen AI evaluation service in Vertex AI lets you evaluate any generative model or application and benchmark the evaluation results against your own judgment, using your own evaluation criteria. It helps with:

Model selection among different models for specific use cases

Model configuration optimization with different model parameters

Prompt engineering for the preferred behavior and responses

Fine-tuning LLMs for improved accuracy, fairness, and safety

Optimizing RAG architectures

Migration between different versions of a model

Managing translation qualities between different languages

Evaluating agents

Evaluating images and videos

It also supports model-based metrics for both pointwise and pairwise evaluations and computation-based metrics with ground-truth datasets of input and output pairs.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

How to use Vertex AI evaluation service
The Vertex AI evaluation service can help you rigorously assess your generative AI models. You can define custom metrics, leveraging pre-built templates or your own expertise, to precisely measure performance against your specific goals. For standard NLP tasks, the service provides computation-based metrics like F1 scores for classification, BLEU for translation, and ROUGE-L for summarization.
For direct model comparison, pairwise evaluations allow you to quantify which model performs better. Metrics like candidate_model_win_rate and baseline_model_win_rate are automatically calculated, and judge models provide explanations for their scoring decisions, offering valuable insights. You can also perform pairwise comparisons using computation based metrics to compare against the ground truth data.
Beyond pre-built metrics, you have the flexibility to define your own, either through mathematical formulas or by using prompts to help “judge models" aligned with the context of the user-defined metrics. Embedding-based metrics are also available for evaluating semantic similarity.
Vertex AI Experiments and Metadata seamlessly integrate with the evaluation service, automatically organizing and tracking your datasets, results, and models. You can easily initiate evaluation jobs using the REST API or Python SDK and export results to Cloud Storage for further analysis and visualization.
In essence, the Vertex AI evaluation service provides a comprehensive framework for:

Quantifying model performance: Using both standard and custom metrics.

Comparing models directly: Through pairwise evaluations and judge model insights.

Customizing evaluations: To meet your specific needs.

Streamlining your workflow: With integrated tracking and easy API access.

It also provides guidance and templates to help you define your own metrics referring to those templates or from scratch with your experiences of prompt engineering and generative AI.
LLM Comparator: An open-source tool for human-in-the-loop LLM evaluation
LLM Comparator is an evaluation tool developed by PAIR (People + AI Research; PAIR) at Google, and is an active research project.
LLM Comparator’s interface is highly intuitive for side-by-side comparisons of different model outputs, making it an excellent tool to augment automated LLM evaluation with human-in-the-loop processes. The tool provides useful features to help you evaluate the responses from two LLMs side-by-side using a range of informative metrics, such as the win rates of Model A or B, grouped by prompt category. It is also simple to extend the tool with user-defined metrics, via a feature called Custom Functions.

The dashboards and visualizations of LLM Comparator by PAIR of Google.

You can see the comparative performance of Model A and Model B across various metrics and prompt categories through ‘Score Distribution’ and ‘Metrics by Prompt Category’ visualizations. In addition, the ‘Rationale Summary visualization provides insights into why one model outperforms another by visually summarizing the key rationales influencing the evaluation results.

The “Rationale Summary” panel visually explains why one model’s responses are determined to be better.

LLM Comparator is available as a Python package on PyPI, and can be installed on a local environment. Pairwise evaluation results from the Vertex AI Evaluation Service can also be loaded into LLM Comparator using provided libraries. To learn more about how you can transform the automated evaluation results to JSON files, please refer to the JSON data format and schema for LLM Comparator.
With features such as the Rationale Cluster visualization and Custom Functions, LLM Comparator can serve as an invaluable tool in the final stages of LLM evaluation where human-in-the-loop processes are needed to ensure overall quality.
Feedback from the field: How LLM Comparator adds value to Vertex AI evaluation service
By augmenting human evaluators with ready-to-use convenient visualizations and performance metrics calculated automatically, LLM Comparator reduces many chores of ML engineers to develop their own visualizations and quality monitoring tools. Thanks to the JSON data format and schema of LLM Comparator, Vertex AI evaluation service and LLM Comparator can be integrated conveniently without any serious amount of development work.
We’ve heard from our teams that the most useful feature of LLM Comparator is the visualization of “Rationale Summary”. “Rationale Summary” can be thought of as a kind of explainable AI (XAI) tool which is very useful to learn why a specific model among the two is better in the judge model’s view. Another important aspect of “Rationale Summary” visualization is that it can be used to understand how a specific language model is working differently from the other model, which is sometimes a very important support to infer why the model is more appropriate for specific tasks.
A limitation of LLM Comparator is that it can be used just for pair-wise model evaluation, not for simultaneous multiple model evaluation. However, LLM Comparator already has basic components for comparative LLM evaluations and extending it to simultaneous multiple model evaluation may not be a big technical problem. This can be an excellent project for you to contribute to the LLM Comparator project.
Conclusion
In this article, we learned and discussed how we can organize the evaluation process of LLMs with Vertex AI and LLM Comparator, an open source LLM evaluation tool by PAIR. By combining Vertex AI Evaluation Service and LLM Comparator, we’ve presented a semi-automated approach to systematically evaluate and compare the performance of diverse LLMs on Google Cloud. Get started with Vertex AI Evaluation Service today.

We thank Rajesh Thallam, Skander Hannachi, and the Applied AI Engineering team for help with this blog post and guidance on overall best practices. We also thank Anant Nawalgaria for help with this blog post and technical guidance.

How to evaluate the impact of LLMs on business outcomes
Evaluating the business impact of LLMs

Read Article

AI Summary and Description: Yes

Summary: The text outlines the importance of evaluating generative AI models, specifically through pairwise model evaluation methods using the Vertex AI evaluation service and LLM Comparator. It highlights how these tools allow for informed decision-making and improve model performance, making them invaluable for AI and security professionals seeking to optimize their generative AI capabilities.

Detailed Description: The provided text focuses on the evaluation of generative AI models through pairwise evaluations, which are crucial for choosing the best model for specific tasks. Here’s a detailed breakdown of the major points:

– **Pairwise Model Evaluation**:
– This method allows for direct comparison between two models, providing a clearer understanding of their relative performance on specific tasks.
– Benefits include:
– **Informed Decision-Making**: As the variety of LLMs (Large Language Models) increases, evaluating their strengths and weaknesses becomes essential for selecting the optimal model.
– **Quantitative Definition of “Better”**: The evaluation relies on human judgment to assess generated content that is often unstructured and lengthy.
– **Ongoing Improvement**: Continuous retraining and tuning of LLMs are essential to enhance performance and maintain competitive relevance.

– **Vertex AI Evaluation Service**:
– A comprehensive tool for evaluating generative models, allowing users to define custom evaluation metrics, optimize model parameters, and ensure model fair representation.
– Key features include:
– Model selection and configuration optimization.
– Support for evaluating different content types (texts, images, translations).
– Capability to define both standard and custom metrics for performance evaluation.

– **LLM Comparator**:
– An open-source tool developed to facilitate human-in-the-loop evaluations, allowing for side-by-side comparisons of different model outputs.
– Its key functionalities include:
– Visualization tools to enhance understanding of model performance.
– Custom functions that let users add personalized evaluation metrics.
– Integration capabilities with Vertex AI for seamless evaluation processes.

– **Integration and Collaboration**:
– The LLM Comparator adds value to the Vertex AI evaluation service by providing visualizations that streamline quality assessment practices for ML engineers.
– The “Rationale Summary” visualization is highlighted as an explainable AI tool that aids understanding of why one model performs better than another.

– **Limitations and Future Development**:
– The LLM Comparator is currently limited to pairwise evaluations, but there is potential for it to evolve to support simultaneous evaluations of multiple models.

In summary, the integration of Vertex AI evaluation services and LLM Comparator forms a robust framework for evaluating and optimizing generative AI models, significantly benefiting professionals in AI, security, and compliance fields. By employing these tools, organizations can enhance their operational efficiency through better model utilization and performance tracking.

1 a access accuracy Act agent agents AGI AI ai model AI models AI tool air analysis and anti API Application Arch architecture architectures Aria art as assessment Auto automated approach based Behavior benchmark Best best practices big tech board building business by C capabilities cell CIA class classification CleaR Cloud cloud storage cluster Col collaboration Comparator competitive compliance Configuration Console content Context core cross Current D dashboard dashboards data dataset datasets day de decision decision-making decisions DeFi definition development development work e e-learning efficiency election end Engineer engineering engineers environment EU evaluation evaluation framework evaluation methods Evaluation Metrics evaluation service evaluations evaluator Excel exp experience expert expertise explainable AI export F1 score face fairness feature features feedback fine fine-tuning flexibility for framework free full future g Gen generated Generated Content generative Generative AI generative AI models generative model Generative Models Go goal Google Google Cloud Group guidance H high Highlight HR http HTTPS human human-in-the-loop image in insights integration inter interface IRS ite J job json judgment Just k Key l Labor language language model language models large large language model large language models learning led Li libraries limitations llm llms lm local loop low mac machine making man math Meta metadata metrics migration Mila ML model model comparison model evaluation model evaluations model outputs model parameters model performance model selection model utilization models Monitor monitoring monitoring tools multi N nation natural language NLP no non NPU o of off on one open open-source operation operational efficiency OPM opt optimization organization organizations ory out Outputs over pairwise model evaluation parameter performance performance evaluation performance metrics point post potential pre problem process processes product products professionals project prompt Prompt Engine prompts Py pypi Python quality assessment question R rack rag rate RCE red representation research response Ro s safe safety schema SD sdk search sec security security professionals Semantic service services SHA side Sig Sim Simple source specific SSE SSL start storage structured summarization system T Task tasks Teams tech templates test text the Thought Time to tool tools Tor Tor project TP tracking training translation trial truth tuning two type UI up US use use cases user user-defined metrics Users uth utilization V val Valuation version Vertex Vertex AI video visualization visualizations Wi workflow x XAI