Cloud Blog: Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

Source URL: https://cloud.google.com/blog/products/data-analytics/introducing-bigquery-metastore-fully-managed-metadata-service/
Source: Cloud Blog
Title: Introducing BigQuery metastore, a unified metadata service with Apache Iceberg support

Feedly Summary: Does your organization use multiple data processing engines like BigQuery, Apache Spark, Apache Flink and Apache Hive? Wouldn’t it be great if you could provide a single source of truth for all of your analytics workloads? Now you can, with the public preview of BigQuery metastore, a fully managed, unified metadata service that provides processing engine interoperability while enabling consistent data governance. 
BigQuery metastore is a highly scalable runtime metadata service that works with multiple engines, for example, BigQuery, Apache Spark, Apache Hive and Apache Flink, and supports the open Apache Iceberg table format. This allows analytics engines to query one copy of the data with a single schema, whether the data is stored in BigQuery storage tables, BigQuery tables for Apache Iceberg, or BigLake external tables. BigQuery metastore serves as a critical component for customers looking to migrate and modernize from legacy data lakes to a modern lakehouse architecture. Integrated deeply with BigQuery’s enterprise capabilities, this solution provides built-in security and governance for user interactions with data.
The challenges of metadata management
Traditionally, metastores and other metadata management systems are tightly coupled with data processing engines. If you are using multiple processing engines, that means maintaining multiple copies of the data and metadata persisted in different metastores. For example, when you create a table definition in Hive Metastore for querying from an open-source engine like Spark, you have to recreate the table definition to query the same data in BigQuery. You also have to build pipelines to keep table definitions synchronized across different metastores. This fragmentation can result in stale metadata, lack of visibility into data lineage, security and access challenges, and a subpar user experience.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>

A metastore for the lakehouse era
BigQuery metastore is designed for the lakehouse architecture, which combines the benefits of data lakes and data warehouses without having to manage both a data lake and a data warehouse — any data, any user, any workload, on a unified platform. It supports open data formats such as Apache Iceberg that are accessible by a variety of processing engines, including BigQuery, Spark, Flink and Hive. The unification of metadata across engines makes it easier to discover and use data, supporting self-service BI and ML tools to drive innovation, while maintaining data governance. 
Furthermore, BigQuery metastore is serverless with no setup or configuration required and automatically scales with your workloads. This no-ops environment reduces TCO and democratizes your data for data analysts, data engineers and data scientists.

Key benefits of BigQuery metastore include:

Cross-engine interoperability: BigQuery metastore provides a single shared metastore for the lakehouse architecture, with a unified view of all metadata for all data sources in the lakehouse, making it easy for your users to find and understand the data they need. This enables query processing and DML for data stored in open and proprietary formats across object stores, BigQuery storage, and across analytics runtimes.

Support for open formats and catalogs: BigQuery metastore provides support for BigQuery storage tables, BigQuery tables for Apache Iceberg and external tables. 

Built-in governance: BigQuery metastore is integrated with key governance capabilities provided in BigQuery, such as automated cataloging and universal search, business metadata, data profiling, data quality, fine-grained access controls, data masking, sharing, data lineage and audit logging. 

Fully managed at BigQuery scale: Being a serverless, fully managed service, BigQuery metastore is very easy to use and has integrations with key engines (BigQuery, Spark, Hive and Flink). The infrastructure foundation used for BigQuery metastore ensures that it scales to the growing query processing volume of your application and can handle traffic at BigQuery scale.

BigQuery metastore in action
Now, let’s take a look at how to use BigQuery metastore. The PySpark script below sets up a Spark environment to interact with a BigQuery storage table, a BigQuery table for Apache Iceberg, and a BigQuery external table. Detailed documentation is provided here.

code_block
<ListValue: [StructValue([(‘code’, ‘from pyspark.sql import SparkSession\r\n\r\n\r\n# Create a spark session\r\nspark = SparkSession.builder \\\r\n.appName(“BigQuery Metastore Iceberg") \\\r\n.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \\\r\n.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \\\r\n.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \\\r\n.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \\\r\n.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \\\r\n.getOrCreate()\r\nspark.conf.set("viewsEnabled","true")\r\n\r\n\r\n# Use the CATALOG_NAME\r\nspark.sql("USE `CATALOG_NAME`;")\r\nspark.sql("USE NAMESPACE DATASET_NAME;")\r\n\r\n\r\n# Configure spark for temp results\r\nspark.sql("CREATE NAMESPACE IF NOT EXISTS MATERIALIZATION_NAMESPACE");\r\nspark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")\r\n\r\n\r\n# List the tables in the dataset\r\ndf = spark.sql("SHOW TABLES;")\r\ndf.show();\r\n\r\n\r\n# Query a BigQuery storage table\r\nsql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""\r\ndf = spark.read.format("bigquery").load(sql)\r\ndf.show()\r\n\r\n\r\n# Query a BigQuery table for Apache Iceberg\r\nsql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""\r\ndf = spark.read.format("bigquery").load(sql)\r\ndf.show()\r\n\r\n\r\n# Query a BigQuery read-only Apache Iceberg external table\r\nsql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""\r\ndf = spark.read.format("bigquery").load(sql)\r\ndf.show()’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3edff9e96b20>)])]>

To customize this script for your environment, simply replace the following variables:

WAREHOUSE_DIRECTORY: the URI of the Cloud Storage folder that contains your data warehouse

CATALOG_NAME: the name of the catalog that you’re using

MATERIALIZATION_NAMESPACE: the namespace for storing temporary results

Learn more
With the BigQuery metastore, you now have a modern, serverless solution to meet your metadata management needs, enabling cross-engine interoperability with built-in governance. To try out BigQuery metastore today, see the documentation. If you would like to migrate from Dataproc Metastore to BigQuery metastore, see the documentation on migration tooling.

AI Summary and Description: Yes

**Summary:** The text details the introduction of BigQuery metastore, a fully managed metadata service designed for interoperability between various data processing engines such as BigQuery, Apache Spark, Apache Hive, and Apache Flink. It highlights the benefits of this service in unifying data governance and metadata management within a lakehouse architecture, making it a significant advancement for organizations handling large analytics workloads.

**Detailed Description:**
BigQuery metastore represents a significant advancement in metadata management for organizations utilizing multiple data processing engines. Here are the critical aspects discussed:

– **Unified Metadata Service**: BigQuery metastore serves as a single source of truth for analytics workloads across various engines, facilitating data governance and interoperability.

– **Scalable and Serverless**: It is a highly scalable service that requires no setup or configuration, automatically adjusting to workload demands. This feature helps reduce total cost of ownership (TCO) and allows data professionals to focus on analytics rather than infrastructure management.

– **Interoperability Across Engines**:
– Supports multiple engines such as BigQuery, Apache Spark, Apache Hive, and Apache Flink.
– Enables querying of data with a single schema across different storage types, including BigQuery tables, BigLake external tables, and Apache Iceberg formats.

– **Data Governance Features**:
– Integrated governance capabilities include automated cataloging, universal search, data profiling, and fine-grained access controls.
– Features like data lineage, audit logging, and data masking enhance security and compliance.

– **Migration to Lakehouse Architecture**: Designed for organizations looking to transition from legacy data lakes to a modern lakehouse framework, BigQuery metastore’s design simplifies the management of metadata across different data sources.

– **Key Benefits**:
– **Cross-engine interoperability**: Simplifies data discovery and enhances user experience.
– **Support for open formats**: Facilitates seamless access to data irrespective of its original source.
– **Comprehensive data governance**: Ensuring security and compliance in data management processes.

By leveraging BigQuery metastore, organizations can streamline their data analytics processes while ensuring robust security and compliance mechanisms are in place. This service not only enhances operational efficiency but also democratizes data access for analytics teams. As such, it represents an essential tool for data-driven organizations aiming to innovate and derive insights from their data.

For further practical implementation, the text provides an example PySpark script that demonstrates how to interact with the BigQuery metastore, emphasizing its user-friendly approach for data engineers and analysts.