Cloud Blog: Introducing AI.GENERATE_TABLE: creating structured data from gen AI models in BigQuery

Source URL: https://cloud.google.com/blog/products/data-analytics/convert-ai-generated-unstructured-data-to-a-bigquery-table/
Source: Cloud Blog
Title: Introducing AI.GENERATE_TABLE: creating structured data from gen AI models in BigQuery

Feedly Summary: The explosion of digital content from social media, smartphones, and other sources has created a massive amount of unstructured data like images, videos, and documents. To help you analyze this data, BigQuery is connected with Vertex AI, Google Cloud’s powerful AI platform, so you can use advanced AI models, like Gemini 2.5 Pro/Flash, to understand the meaning hidden within your unstructured data.
Google’s advanced AI models can analyze a wide range of data formats, from text and images to audio and video. They can extract key information like names, dates, and keywords, transforming raw data into structured insights that integrate with your existing tools. Plus, with new techniques like constrained decoding, these models can even generate structured data in JSON format, helping to ensure compatibility with your workflows.
To further streamline this process, we recently added a new BigQuery feature called AI.GENERATE_TABLE(), which builds upon the capabilities of ML.GENERATE_TEXT(). This function allows you to automatically convert the insights from your unstructured data into a structured table within BigQuery, based on the provided prompt and table schema. This streamlined process allows you to easily analyze the extracted information using your existing data analysis tools.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>

Extracting structured data from images
Let’s dive deeper into how this new feature works with an example that uses three images. First, you have a picture of the Seattle skyline featuring the iconic Space Needle. Next, you have a city view of New York City. Finally, you have an image of cookies and flowers, which is unrelated to cityscapes.

To use these images with BigQuery’s generative AI functions, you first need to make them accessible to BigQuery. You can do this by creating a table namely “image_dataset” that connects to the Google Cloud Storage bucket where the images are stored.

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE EXTERNAL TABLE\r\n bqml_tutorial.image_dataset\r\nWITH CONNECTION DEFAULT \r\nOPTIONS(object_metadata=”DIRECTORY",\r\n uris=["gs://bqml-tutorial-bucket/images/*"])’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c777282e0>)])]>

Now that you’ve prepared your image data, let’s connect to the powerful Gemini 2.5 Flash model. You do this by creating a "remote model" within BigQuery, which acts as a bridge to this advanced AI.

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE MODEL\r\n bqml_tutorial.gemini25flash001\r\nREMOTE WITH CONNECTION DEFAULT \r\nOPTIONS (endpoint = "gemini-2.5-flash-001")’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c79bf2a00>)])]>

Now, let’s use the AI.GENERATE_TABLE() function to analyze the images. You’ll need to provide the function with two things: the remote model you created (connected to Gemini 2.5 Flash) and the table containing your images.
You’ll ask the model to "Recognize the city from the picture and output its name, belonging state, brief history, and tourist attractions. Please output nothing if the image is not a city." To ensure the results are organized and easy to use, we’ll specify a structured output format with the following fields:

city_name (string)

state (string)

brief_history (string)

attractions (array of strings)

This format, known as a schema, ensures the output is consistent and compatible with other BigQuery tools. You’ll notice that the syntax for defining this schema is the same as the CREATE TABLE command in BigQuery.

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT\r\n city_name,\r\n state,\r\n brief_history,\r\n attractions,\r\n uri\r\nFROM\r\n AI.GENERATE_TABLE( MODEL bqml_tutorial.gemini25flash001,\r\n (\r\n SELECT\r\n ("Recognize the city from the picture and output its name, belonging state, brief history, and tourist attractions. Please output nothing if the image is not a city.", ref) AS prompt,\r\n uri\r\n FROM\r\n bqml_tutorial.image_dataset),\r\n STRUCT( "city_name STRING, state STRING, brief_history STRING, attractions ARRAY<STRING>" AS output_schema,\r\n 8192 AS max_output_tokens))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c77657040>)])]>

When you run the AI.GENERATE_TABLE() function, it produces a table with five columns. Four of these columns match the schema you defined (city_name, state, brief_history, and attractions), while the fifth column contains the image URI from the input table.
As you can see, the model successfully identified the cities in the first two images, providing their names and the states in which they are found. It even generated a brief history and a list of attractions for each city based on its internal knowledge. This demonstrates the power of large language models to extract information and insights directly from images.

Extracting structured data from medical transcriptions
Now let’s see another example where you can use AI.GENERATE_TABLE to extract information from unstructured data stored in a BQ managed table. We are going to use the Kaggle Medical Transcriptions dataset which contains sample medical transcriptions from various specialities. 
Transcriptions are long and verbose and have all kinds of information, e.g. a patient’s age, weight, blood pressure, conditions, etc. It is challenging and time-consuming for people to process them manually and make it well organized. But now, we can let the LLM and AI.GENERATE_TABLE help us.
Suppose you need the following information:

age (int64)

blood_pressure (struct<high int64, low int64)

weight (float64)

conditions (array of strings)

diagnosis (array of strings)

medications (array of strings)

We can come up with this SQL query:

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT\r\n age,\r\n blood_pressure,\r\n weight,\r\n conditions,\r\n diagnosis,\r\n medications,\r\n prompt\r\nFROM\r\n AI.GENERATE_TABLE(MODEL bqml_tutorial.gemini25flash001,\r\n (\r\n SELECT\r\n input_text AS prompt\r\n FROM\r\n bqml_tutorial.kaggle_medical_transcriptions\r\n LIMIT\r\n 3),\r\n STRUCT(\r\n "age INT64, blood_pressure STRUCT<high INT64, low INT64>, weight FLOAT64, conditions ARRAY<STRING>, diagnosis ARRAY<STRING>, medications ARRAY<STRING>" AS output_schema,\r\n 1024 AS max_output_tokens))’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e6c78fdd1f0>)])]>

You can see that the model successfully extracted the information from the medical transcriptions and the results are organized as the schema specified with the help of AI.GENERATE_TABLE.

The AI.GENERATE_TABLE() function can help you transform your data and create a BigQuery table for easy analysis and integration with your existing workflows. To learn more about the full syntax, refer to the documentation. Have feedback on these new features or have additional feature requests? Let us know at bqml-feedback@google.com.

AI Summary and Description: Yes

**Summary:**
The text outlines the capabilities of Google’s BigQuery integrated with Vertex AI for processing and analyzing unstructured data, particularly through its AI.GENERATE_TABLE() function. This feature enables users to extract structured insights from various data forms, including images and medical transcriptions, utilizing advanced AI models like Gemini 2.5.

**Detailed Description:**
The document presents Google’s latest advancements in using AI to handle unstructured datasets by integrating BigQuery with Vertex AI. This effort is particularly significant as it enhances the ability to parse and gain insights from a plethora of data types, which is crucial for professionals dealing with large volumes of unstructured data across AI, cloud, and infrastructure sectors.

– **Key Features:**
– **AI Models and Data Formats**: Google’s AI models can analyze multiple data formats (text, images, audio, video) to extract key structured information (e.g., names, dates, keywords).
– **AI.GENERATE_TABLE() Function**: This new feature allows for automatic transformation of unstructured data insights into structured tables in BigQuery, simplifying analysis.
– **Image Analysis Example**: The procedure of utilizing images with generative AI functions involves creating a connection to Google Cloud Storage and integrating with the Gemini 2.5 Flash model to extract city names and related information.
– **Medical Transcription Example**: The use of AI.GENERATE_TABLE() in medical contexts highlights its ability to process complex transcription data to output structured details such as age, blood pressure, and conditions.

– **Implications for Security and Compliance**:
– As organizations increasingly rely on AI to analyze sensitive data (medical records), it is vital to consider data privacy and security implications.
– Compliance with relevant regulations must be ensured when handling personal data, which may necessitate implementing robust security measures and governance protocols.

The introduction of these features is critical for businesses that need efficient data processing capabilities, especially in the context of AI’s growing role in analytics. This advancement not only improves data handling but also raises considerations for compliance and privacy, as handling sensitive information requires careful alignment with regulatory frameworks.