Source URL: https://cloud.google.com/blog/products/data-analytics/new-objectref-data-type-brings-unstructured-data-into-bigquery/
Source: Cloud Blog
Title: Introducing BigQuery ObjectRef: Supercharge your multimodal data and AI processing
Feedly Summary: Traditional data warehouses simply can’t keep up with today’s analytics workloads. That’s because today, most data that’s generated is both unstructured and multimodal (documents, audio files, images, and videos). With the complexity of cleaning and transforming unstructured data, organizations have historically had to maintain siloed data pipelines for unstructured and structured data, and for analytics and AI/ML use cases. Between these fragmented data platforms, data access restrictions, slow consumption, and outdated information, enterprises struggle to unlock the full potential of their data. The same issues hinder AI initiatives.
Today we’re introducing a new data type, ObjectRef, now in preview in BigQuery, that represents a reference to any object in Cloud Storage with a URI and additional metadata. ObjectRef complements Object Tables, read-only tables over unstructured data objects in Cloud Storage, to integrate unstructured data like images and audio into existing BigQuery tables. The ObjectRef data type removes fragmentation in data processing and access control, providing a unified, multimodal, and governed way to process all modalities of data. You can process unstructured data with large language models (LLMs), ML models, and open-source Python libraries using the same SQL or Python scripts that process tabular data. You can also store structured and unstructured data in the same row throughout different data engineering stages (extract, load, transform a.k.a. ELT), and govern it using a similar access control model.
aside_block
For example, to answer the question “of the customers who complained about performance issues during interactions last month, show me the top 10 by revenue” you need to perform natural language processing (NLP) on audio calls, emails and online chat transcripts to normalize the data, identify whether the interaction discussed “performance issues” and detect whether the customer complained. For each of these steps, you need to decide how to build a pipeline over data in Cloud Storage, run AI/ML models on the data, and host the models (e.g., on Compute Engine, Google Kubernetes Engine, or Vertex AI). The normalized and extracted data would then need to be saved in structured format (e.g., in a BigQuery table) and joined with each customer’s revenue data.
With the launch of ObjectRef, you can now answer this question with a simple SQL query. Suppose you’ve combined call center audio files and agent chat text into one BigQuery table customer_interactions using columns (1) audio_ref of type ObjectRef, (2) chat of type STRING. Filtering for customers who complained about performance issues is as easy as adding one more condition in the WHERE clause:
BigQuery with ObjectRef unlocks unique platform capabilities across data and AI:
Multimodality: Natively handle structured (tabular) data, unstructured data, and a combination of the two, in a single table via ObjectRef. Now, you can build multimodal ELT data pipelines to process both structured and unstructured data.
Full SQL and Python support: Use your favorite language without worrying about interoperability. If it works in SQL, it works in Python (via BigQuery DataFrames), and vice versa. Object transformations, saving transformed objects back to Cloud Storage, and any other aggregations or filtering, can all be done in one SQL or Python script.
Gen-AI-ready, serverless, and auto-scaled data processing: Spend more time building your data pipelines, not managing infrastructure. Process unstructured data with LLMs, or use serverless Python UDFs with your favorite open-source library. Create embeddings, generate summaries using a prompt, use a BigQuery table as an input to Vertex AI jobs, and much more.
Unified governance and access control: Use familiar BigQuery governance features such as fine-grained access control, data masking, and connection-delegated access on unstructured data. There is no need to manage siloed governance models for structured versus unstructured data.
ObjectRef in action
Let’s take a closer look at how to use the ObjectRef data type.
What is an ObjectRef?
First, it’s good to understand ObjectRef under the hood. Simply put, ObjectRef is a STRUCT containing object storage and access control metadata. With this launch, when you create an Object Table, it is populated with a new ObjectRef column named ‘ref’.
code_block
<ListValue: [StructValue([(‘code’, ‘struct {\r\n uri string,\r\n authorizer string,\r\n version string, \r\n details json { \r\n\tgcs_metadata json\r\n }\r\n }’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed43e6869d0>)])]>
Create a BigQuery table with ObjectRefs
Imagine a call center that stores structured information in standard BigQuery tables ingestion.sessions, and call audio in a Cloud Storage bucket, with a BigQuery Object Table ingestion.audios created on the Cloud Storage bucket. While this example is based on audio, ObjectRefs can also represent images, documents, and videos.
In the following diagrams, ObjectRefs are highlighted in red.
With ObjectRef, you can join these two tables on sessions.RecordingID and audios.Ref.uri columns to create a single BigQuery table. The new table contains an Audio column of type ObjectRef, using the Ref column from the ingestion.audios table.
code_block
<ListValue: [StructValue([(‘code’, ‘CREATE OR REPLACE TABLE analysis.sessions\r\nAS\r\nSELECT sessions.session_id, sessions.date, sessions.customer_id, object_table.ref AS audio\r\nFROM ingestion.sessions INNER JOIN ingestion.audios object_table\r\nON object_table.uri = sessions.recording_id;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf760>)])]>
Capturing the object version allows BigQuery zero-copy snapshots and clones of analysis.sessions to be reproducible and consistent across structured and unstructured data. This allows reproducibility in downstream applications such as ML training and LLM fine-tuning.
Being a STRUCT, ObjectRef also supports nesting in ARRAY. The main audio file represented by Audio can be chunked (for example, into segments per agent ID), and the resulting objects represented in a new column Chunked of type ARRAY<ObjectRef>. This preserves the order of chunks, and stores them alongside the main audio file in the same row. This data transformation lets you report the number of agent handoffs per call and further analyze each call segment separately.
Process using serverless Python
With Python UDF integration, you can bring your favorite open-source Python library to BigQuery as a user-defined function (UDF). Easily derive structured data, and unstructured data from the source ObjectRef and store them in the same row.
The new function OBJ.GET_ACCESS_URL(ref ObjectRef, mode STRING) -> ObjectRefRuntime enables delegated access to the object in Cloud Storage. ObjectRefRuntime provides signed URLs to read and write data, allowing you to manage governance and access control entirely in BigQuery, and removing the need for Cloud Storage access control.
Serverless Python use case 1: Multimodal data to structured dataFor example, imagine you want to get the duration of every audio file in the analysis.sessions table. Assume that a Python UDF function analysis.GET_DURATION(object_ref_runtime_json STRING) -> INT has already been registered in BigQuery. GET_DURATION uses signed URLs from ObjectRefRuntime to read Cloud Storage bytes.
code_block
<ListValue: [StructValue([(‘code’, ‘– Object is passed to Python UDF using read-only signed URLs\r\nSELECT analysis.GET_DURATION(TO_JSON_STRING(OBJ.GET_ACCESS_URL(audio, “R"))) AS duration\r\nFROM analysis.sessions\r\nWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf250>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpd\r\ndf = bpd.read_gbq("analysis.sessions")\r\nfunc = bpd.read_gbq_function("analysis.get_duration")\r\n# Object is passed to Python UDF using read-only signed URLs\r\ndf["duration"] = df["audio"].blob.get_runtime_json_str(mode="R").apply(func).cache() # cache to execute’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf2e0>)])]>
Serverless Python use case 2: Multimodal data to processed multimodal dataAs another example, here’s how to remove noise from every audio file in the analysis.sessions table, assuming that a Python UDF function analysis.DENOISE(src_object_ref_runtime_json STRING, dst_object_ref_runtime_json STRING)-> object_ref_runtime_json STRING has already been registered in BigQuery. This function reads from the source audio, writes the new noise-removed audio to Cloud Storage, and returns ObjectRefs for the new audio files.
ObjectRefRuntime provides signed URLs for reading and writing object bytes.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT analysis.DENOISE(\r\n — Source is accessed using read-only signed URL\r\n TO_JSON_STRING(OBJ.GET_ACCESS_URL(audio, "R")), \r\n — Destination is written using read-write signed URL with prefix "denoised-"\r\n TO_JSON_STRING(OBJ.GET_ACCESS_URL(\r\n OBJ.MAKE_REF(\r\n CONCAT("denoised-", audio.uri), audio.authorizer),\r\n "RW"))\r\nFROM analysis.sessions\r\nWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfa60>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpd\r\ndf = bpd.read_gbq("analysis.sessions")\r\n\r\ndf["denoised"] = ("denoised-" + df["audio"].blob.uri()).str.to_blob()\r\nfunc_df = df[["audio", "denoised"]]\r\n\r\nfunc = bpd.read_gbq_function("analysis.denoise")\r\n# Source is accessed using read-only signed URL\r\nfunc_df["audio"] = func_df["audio"].blob.get_runtime_json_str("R")\r\n# Destination is written using read-write signed URL with prefix "denoised-"\r\nfunc_df["denoised"] = func_df["denoised"].blob.get_runtime_json_str("RW")\r\nfunc_df.apply(func, axis=1).cache() # cache to execute’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfdf0>)])]>
Process using Gemini and BigQuery ML
All BigQuery ML generative AI functions such as AI.GENERATE, ML.GENERATE_TEXT and ML.GENERATE_EMBEDDING now support ObjectRefs as first-class citizens. This enables a number of use cases.
BQML use case 1: Multimodal inference using GeminiYou can now pass multiple ObjectRefs in the same Gemini prompt for inference.
Here, you can use Gemini to evaluate noise removal quality by comparing the original audio file and the noise-removed audio file. This script assumes the noise-reduced audio file ObjectRef is already stored in column Denoised.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(\r\n prompt => ("Compare original audio file to audio file with noise removed, and output quality of noise removal as either good or bad. Original audio is", OBJ.GET_ACCESS_URL(audio, "r"), "and noise removed audio is", OBJ.GET_ACCESS_URL(denoised, "r")),\r\n — BQ connection with permission to call Gemini\r\n connection_id => "analysis.US.gemini-connection",\r\n endpoint => "gemini-2.0-flash"\r\n).result\r\nFROM analysis.sessions WHERE audio IS NOT NULL AND denoised IS NOT NULL;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cf580>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpd\r\nfrom bigframes.ml import llm\r\n\r\ngemini = llm.GeminiTextGenerator(model_name="gemini-2.0-flash", connection_name="analysis.US.gemini-connection")\r\ndf = bpd.read_gbq("analysis.sessions")\r\nresult = gemini.predict(df, prompt=["Compare original audio file to audio file with noise removed, and output quality of noise removal as either good or bad. Original audio is", df["audio"], "and denoised audio is", df["denoised"]])\r\nresult[["ml_generate_text_llm_result"]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfc70>)])]>
As another example, here’s how to transcribe the Audio file using Gemini.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(\r\n prompt => ("Transcribe this audio file", OBJ.GET_ACCESS_URL(audio, "r")),\r\n — BQ connection with permission to call Gemini\r\n connection_id => "analysis.US.gemini-connection",\r\n endpoint => "gemini-2.0-flash").result as transcript\r\nFROM analysis.sessions\r\nWHERE audio IS NOT NULL’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cfeb0>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpd\r\nfrom bigframes.ml import llm\r\n\r\ngemini = llm.GeminiTextGenerator(model_name="gemini-2.0-flash", connection_name="analysis.US.gemini-connection")\r\ndf = bpd.read_gbq("analysis.sessions")\r\nresult = gemini.predict(df, prompt=["Transcribe this audio file", df["audio"]])\r\nresult[["ml_generate_text_llm_result"]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed44a3cff70>)])]>
With BQML + Gemini, you can also generate structured or semi-structured results from multimodal inference. For example, you can do speaker diarization in the Audio file using Gemini to identify the operator vs. the customer.
code_block
<ListValue: [StructValue([(‘code’, ‘SELECT AI.GENERATE(\r\nprompt => ("Generate audio diarization for this interview. Use JSON format for the output, with the following keys: speaker, transcription. If you can classify the speaker as customer vs operator, please do. If not, use speaker A, speaker B, etc.", OBJ.GET_ACCESS_URL(audio, "r")),\r\n — BQ connection with permission to call Gemini\r\nconnection_id => "analysis.US.gemini_connection",\r\nendpoint => "gemini-2.0-flash").result as diarized_json\r\nFROM analysis.sessions\r\nWHERE audio IS NOT NULL;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9d90>)])]>
code_block
<ListValue: [StructValue([(‘code’, ‘import bigframes.pandas as bpd\r\nfrom bigframes.ml import llm\r\n\r\ngemini = llm.GeminiTextGenerator(model_name="gemini-2.0-flash", connection_name="analysis.US.gemini-connection")\r\ndf = bpd.read_gbq("analysis.sessions")\r\nresult = gemini.predict(df, prompt=["Generate audio diarization for this interview. Use JSON format for the output, with the following keys: speaker, transcription. If you can classify the speaker as customer vs operator, please do. If not, use speaker A, speaker B, etc.", df["audio"]])\r\nresult[["ml_generate_text_llm_result"]]’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9280>)])]>
BQML use case 2: Multimodal embeddings using Gemini
With ML.GENERATE_EMBEDDING support, you can use ObjectRefs with text embedding and multimodal embedding models to create vector indices, and power RAG workflows to ground LLMs.
Assume we have an Object Table ingestion.images with the ref column containing image ObjectRefs.
code_block
<ListValue: [StructValue([(‘code’, "CREATE OR REPLACE MODEL `ingestion.multimodal_embedding_model`\r\nREMOTE WITH CONNECTION ‘ingestion.US.gemini-connection’\r\nOPTIONS (ENDPOINT = ‘multimodalembedding@001’);\r\n\r\nSELECT ref, ml_generate_embedding_result as embedding\r\nFROM ML.GENERATE_EMBEDDING(\r\n MODEL `ingestion.multimodal_embedding_model`,\r\n (\r\n SELECT OBJ.GET_ACCESS_URL(ref, ‘r’) as content, ref\r\n FROM ingestion.images\r\n ),\r\n STRUCT (256 AS output_dimensionality)\r\n);"), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c9cd0>)])]>
code_block
<ListValue: [StructValue([(‘code’, ’embed_model = llm.MultimodalEmbeddingGenerator(model_name="multimodalembedding@001", connection_name="ingestion.US.gemini-connection")\r\ndf = bpd.read_gbq("ingestion.images")\r\nembeddings = embed_model.predict(df["ref"])\r\nembeddings’), (‘language’, ‘lang-py’), (‘caption’, <wagtail.rich_text.RichText object at 0x3ed4492c96d0>)])]>
What’s new
To summarize, here’s a list of all the new capabilities for performing analytics on unstructured and/or multimodal data using BigQuery:
New types and functions for handling multimodal data (documents, audio files, images, and videos):
ObjectRef and ObjectRefRuntime types along with new functions: OBJ.MAKE_REF, OBJ.GET_ACCESS_URL and OBJ.FETCH_METADATA
Object Table enhancements:
Scalability: Object Tables now support consistent views of Cloud Storage buckets, scaling 5x from 65M to 300M+ objects per table, and ingesting up to 1M object changes per hour per table
Interop with ObjectRef: New ref column provides pre-constructed ObjectRefs directly from Object Tables
BQML Gen-AI multimodal capabilities:
Support multimodal inference in TVFs ML.GENERATE_TEXT and AI.GENERATE_TABLE, and scalar functions such as AI.GENERATE, and AI.GENERATE_BOOL, by encapsulating multiple objects in the same prompt for Gemini using ObjectRef. Objects can be sourced from different columns, and complex types such as arrays.
Support embedding ObjectRef via the ML.GENERATE_EMBEDDING function
BigQuery DataFrames multimodal dataframe support:
An extension to pandas-like dataframe to include unstructured data (powered by ObjectRef) as just another column
Wrangle, process and filter mixed modality data with the familiarity of dataframe operations
Special transformers for unstructured data like chunking, image processing, transcription made available through server side processing functions and BQML
Python UDF support:
Leverage the rich Python library ecosystem for advanced unstructured data manipulation in a fully managed, serverless experience with BigQuery governance
Get started today
ObjectRef is now in preview. Follow these simple steps to get started:
Watch to learn more – watch live demos from Cloud Next about unifying unstructured and structured data, and generating text with LLMs and performing vector search.
Learn by doing – try out ObjectRefs with this multimodal data tutorial using either SQL or Python tutorials.
Build your use case – locate the Cloud Storage bucket containing the unstructured data you want to analyze. Create an Object Table or set up automatic Cloud Storage discovery to pull this data into BigQuery. The Object Table will contain a column of ObjectRefs and now you are ready to start transforming the data.
AI Summary and Description: Yes
Summary: The introduction of the new ObjectRef data type in BigQuery enables organizations to effectively unify and govern structured and unstructured data, addressing the challenges of siloed data pipelines. This innovation facilitates the integration of multimodal data types into analytics workflows, significantly enhancing the potential for AI and machine learning applications.
Detailed Description:
The text discusses a significant development in data management and analytics, particularly within Google Cloud’s BigQuery platform. The introduction of ObjectRef addresses longstanding issues companies face when dealing with both structured and unstructured data, making it crucial for professionals in AI, cloud computing, and data analytics. The ObjectRef data type is critical for advancing the capabilities of data processing and governance, particularly in an era where enterprises increasingly rely on diverse data types.
Key Insights and Major Points:
– **Unified Data Processing**:
– ObjectRef allows for a new way to refer to any object in Cloud Storage with rich metadata, enabling seamless integration of unstructured data (images, audio, video) with structured data.
– **Extensive Compatibility**:
– Tools and scripts designed to process structured data can now handle unstructured data without significant modifications, supporting both SQL and Python.
– **Enhanced Analytics Capability**:
– Users can perform complex data analysis using natural language processing (NLP) techniques over audio calls and chat transcripts directly in BigQuery, greatly simplifying the data preparation process.
– **Multimodal Data Management**:
– Organizations can manage multimodal datasets (structured and unstructured combined) in a single environment, making it easier to build data pipelines and perform advanced analytics.
– **Governance and Access Control**:
– ObjectRef offers a unified approach to data governance, allowing the application of existing BigQuery access controls to unstructured data, removing the need for separate governance models.
– **Serverless Processing**:
– The architecture allows for serverless processing capabilities with auto-scaling features, which reduces operational overhead for organizations focusing on analytics rather than infrastructure management.
– **Application of Generative AI**:
– The integration of BigQuery with generative AI models (Gemini) supports sophisticated applications such as audio transcription, performance assessment, and more, further emphasizing the utility of ObjectRef in AI initiatives.
– **Practical Use Cases**:
– New use cases include multimodal inference, generation of multimodal embeddings, and support for processing complex structures such as JSON for audio diarization tasks.
– **Getting Started**:
– The launch of ObjectRef is in preview, and organizations are encouraged to explore its capabilities through live demos and tutorials available from Google Cloud.
Overall, the ObjectRef data type represents a transformative capability that aligns with modern data analytics needs and empowers organizations to maximize the value of their data assets effectively while ensuring robust governance and compliance.