Cloud Blog: Evaluate your gen media models with multimodal evaluation on Vertex AI

Source URL: https://cloud.google.com/blog/products/ai-machine-learning/evaluate-your-gen-media-models-on-vertex-ai/
Source: Cloud Blog
Title: Evaluate your gen media models with multimodal evaluation on Vertex AI

Feedly Summary: The world of generative AI is moving fast, with models like Lyria, Imagen, and Veo now capable of producing stunningly realistic and imaginative images and videos from simple text prompts. However, evaluating these models is still a steep challenge. Traditional human evaluation, while the gold standard, can be slow and costly, hindering rapid development cycles.
To address this, we’re thrilled to introduce Gecko, now available through Google Cloud’s Vertex AI Evaluation Service. Gecko is a rubric-based and interpretable autorater for evaluating generative AI models that empowers developers with a more nuanced, customizable, and transparent way to assess the performance of image and video generation models.
The challenge of evaluating generative models with auto-raters
Creating useful, performant auto-raters is challenging as the quality of generation dramatically improves. While specialised models can be efficient, they lack the interpretability developers need to understand model behavior and pinpoint areas for improvement. For instance, when evaluating how accurately a generated image depicts a prompt, a single score doesn’t reveal why a model succeeded or failed.
Introducing Gecko: Interpretable, customizable, and performant evaluation
Gecko offers a fine-grained, interpretable, and customizable auto-rater. This Google DeepMind research paper shows that such an auto-rater can reliably evaluate image and video generation across a range of skills, reducing the dependency on costly human judgment. Notably, beyond its interpretability, Gecko exhibits strong performance and has already been instrumental in benchmarking the progress of leading models like Imagen.
Gecko makes evaluation interpretable with its  clear, step-by-step rubric-based approach. Let’s take an example and use Gecko to evaluate the generated media of a cup of coffee and a croissant on a table.

Figure 1: Prompt and image pair we will use as our running example

Step 1: Semantic prompt decomposition.
Gecko leverages a Gemini model to first break down the input text prompt into key semantic elements that need to be verified in the generated media. This includes identifying entities, their attributes, and the relationships between them.
For the running example, the prompt is broken down into keywords: Steaming, cup of coffee, croissant, table.
Step 2: Question generation.
Based on the decomposed prompt, the Gemini model then generates a series of question-answer pairs. These questions are specifically designed to probe the generated image or video for the presence and accuracy of the identified elements and relationships. Optionally, Gemini can provide justifications for why a particular answer is correct, further enhancing transparency.
Let’s take a look at the running example and generate question-answer pairs for each keyword. For the keyword Steaming, the question-answer pair is ‘is the cup of coffee steaming? [‘yes’, ‘no’]’ with the ground-truth answer ‘yes’.

Figure 2: Visualisation of the outputs from the semantic prompt decomposition and question-answer generation steps.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/vertex-ai/’), (‘image’, None)])]>

Step 3: Scoring
Finally, the Gemini model scores the generated media against each question-answer pair. These individual scores are then aggregated to produce a final evaluation score.
For the running example, all questions were found to be correct, giving a perfect final score.

Figure 3: Visualisation of the outputs from the scoring step, giving scores for each question which are aggregated to give a final overall score.

Evaluate with Gecko on Vertex AI
Gecko is now available via the Gen AI Evaluation Service in Vertex AI, empowering you to evaluate image or video generative models. Here’s how you can get started with Gecko evaluation for images and videos on Vertex AI:
First, you’ll need to set up configurations for both rubric generation and rubric validation.

code_block
<ListValue: [StructValue([(‘code’, ‘# Rubric Generation\r\nrubric_generation_config = RubricGenerationConfig(\r\n prompt_template=RUBRIC_GENERATION_PROMPT,\r\n parsing_fn=parse_json_to_qa_records,\r\n)\r\n# Rubric Validation\r\npointwise_metric = PointwiseMetric(\r\n metric=”gecko_metric",\r\n metric_prompt_template=RUBRIC_VALIDATOR_PROMPT,\r\n custom_output_config=CustomOutputConfig(\r\n return_raw_output=True,\r\n parsing_fn=parse_rubric_results,\r\n ),\r\n)\r\n# Rubric Metric\r\nrubric_based_gecko = RubricBasedMetric(\r\n generation_config=rubric_generation_config,\r\n critique_metric=pointwise_metric,\r\n)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebb7d98bb20>)])]>

Next, prepare your dataset for evaluation. This involves creating a Pandas DataFrame with columns for your prompts and the corresponding generated images or videos.

code_block
<ListValue: [StructValue([(‘code’, ‘prompts = [\r\n "steaming cup of coffee and a croissant on a table",\r\n "steaming cup of coffee and toast in a cafe",\r\n # … more prompts\r\n]\r\nimages = [\r\n \'{"contents": [{"parts": [{"file_data": {"mime_type": "image/png", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png"}}]}]}\’,\r\n \'{"contents": [{"parts": [{"file_data": {"mime_type": "image/png", "file_uri": "gs://cloud-samples-data/generative-ai/evaluation/images/coffee.png"}}]}]}\’,\r\n # … more image URIs\r\n]\r\neval_dataset = pd.DataFrame(\r\n {\r\n "prompt": prompts,\r\n "image": images, # or "video": videos for video evaluation\r\n }\r\n)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebb7d98b370>)])]>

Now, you can generate the rubrics based on your prompts using the configured rubric_based_gecko metric.

code_block
<ListValue: [StructValue([(‘code’, ‘dataset_with_rubrics = rubric_based_gecko.generate_rubrics(eval_dataset)’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebb7d98ba30>)])]>

Finally, run the evaluation using the generated rubrics and your dataset. The evaluate method of EvalTask will use the rubric validator to score the generated content.

code_block
<ListValue: [StructValue([(‘code’, ‘eval_task = EvalTask(\r\n dataset=dataset_with_rubrics,\r\n metrics=[rubric_based_gecko],\r\n)\r\neval_result = eval_task.evaluate(response_column_name="image") # or "video"’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebb7d98b700>)])]>

After the evaluation runs, you can compute and analyze the final scores to understand how well your generated content aligns with the detailed criteria derived from your prompts.
Python

code_block
<ListValue: [StructValue([(‘code’, ‘dataset_with_final_scores = compute_scores(eval_result.metrics_table)\r\nnp.mean(dataset_with_final_scores["final_score"])’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ebb7d98b1f0>)])]>

Vertex AI Gen AI evaluation service offers summary and metrics tables, providing detailed insights into the evaluation performance. Beyond that, for Gecko you will find the category or concept which each of the questions is categorized as, as well as the score of the generated image or video performed against that category. For example “is the cat grey?” would be a question which falls under the question category: “color”
Access to these granular evaluation results enables you to create meaningful visualizations of the performance of the models across various criterion, including bar and radar charts like the one below:

Figure 4: Visualisation of the aggregate performance of the generated media for various categories/criterion

With Gecko on Vertex AI, you gain access to a robust framework for assessing model’s capabilities at finer detail. You can refer to the text- to-image and text-to-video evaluation Colabs to get a first hand experience today.

AI Summary and Description: Yes

Summary: The text discusses the introduction of Gecko, an advanced evaluation tool for generative AI models, specifically within Google Cloud’s Vertex AI. Gecko provides a customizable and interpretable way to assess image and video generation capabilities, addressing traditional evaluation challenges by enhancing transparency and reducing reliance on human evaluators.

Detailed Description:

– **Overview of Generative AI Models**: The text highlights the rapid advancements in generative AI, specifically mentioning models like Lyria, Imagen, and Veo, which can produce high-quality images and videos from textual prompts. This innovation presents a challenge in the evaluation process.

– **Challenges in Evaluation**:
– Traditional human evaluations, although high-quality, are slow and costly.
– The need for efficient and interpretable auto-raters emerges as the quality of generated outputs improves. Simple scores cannot reveal the reasons behind a model’s performance.

– **Introduction of Gecko**:
– Gecko is presented as a sophisticated auto-rater designed to assess generative AI models with a focus on interpretability and performance.
– It uses a rubric-based approach to evaluation, which is customizable and designed to generate detailed insights.

– **Evaluation Process with Gecko**:
– The evaluation consists of several steps:
– **Semantic Prompt Decomposition**: Splitting the input prompt into essential elements (e.g., identifying objects and their attributes).
– **Question Generation**: Creating probing questions regarding the presence and accuracy of the identified elements.
– **Scoring**: The model scores the generated media based on the question-answer pairs generated in the previous step and aggregates these to provide a final evaluation score.

– **Technical Implementation**:
– The text includes code snippets demonstrating how to set up fragrance configurations, prepare datasets for evaluation, and generate rubrics using Gecko.
– It discusses the automation of generating question-answer pairs using a Gemini model, which ensures that evaluation is thorough and precise.
– Finally, it outlines how to analyze the results, providing insights into model performance across various criteria.

– **Use Cases**: Gecko can be particularly beneficial for developers and researchers working with generative AI as it provides a framework to improve model evaluation efficiently without extensive human input.

– **Practical Implications**: The availability of such a tool within Google Cloud’s Vertex AI allows practitioners in AI and machine learning to leverage it for enhanced model assessment, ultimately leading to more efficient workflows and better-quality outputs.

In conclusion, Gecko is a significant advancement for AI workflows, particularly in evaluating generative models, which poses practical implications for professionals in AI development, evaluation, and deployment.