Source URL: https://cloud.google.com/blog/products/data-analytics/enhancing-biglake-for-iceberg-lakehouses/
Source: Cloud Blog
Title: BigLake evolved: Build open, high-performance, enterprise Iceberg-native lakehouses
Feedly Summary: Data management is changing. Enterprises need flexible, open, and interoperable architectures that allow multiple engines to operate on a single copy of data. Apache Iceberg has emerged as the leading open table format, but in real-world deployments, customers often face a dilemma: embrace the openness of Apache Iceberg but compromise on fully managed, enterprise-grade storage management, or choose managed storage but sacrifice the flexibility of open formats.
This week, we announced innovations in BigLake, a storage engine that provides a foundation for building open data lakehouses on Google Cloud that bring the best of Google’s infrastructure to Apache Iceberg, eliminating the trade-off between open-format flexibility and high-performance enterprise-grade managed storage. These innovations include:
Open interoperability across analytical and transactional systems: Formerly known as BigQuery metastore, the fully managed, serverless, scalable BigLake Metastore, now generally available (GA), simplifies runtime metadata management and works across BigQuery as well as other Iceberg compatible engines. Powered by Google’s planet-scale metadata management infrastructure, it removes the need to manage custom metastore deployments. We are also introducing support for the Iceberg REST Catalog API (Preview). The BigLake metastore provides the foundation for interoperability, allowing you to access all your Cloud Storage and BigQuery storage data across multiple runtimes including BigQuery, AlloyDB (preview), and open-source, Iceberg-compatible engines such as Spark and Flink.
New, high-performance Iceberg-native Cloud Storage: We are simplifying lakehouse management with automatic table maintenance (including compaction and garbage collection) and integration with Google Cloud Storage management tools, including auto-class tiering and encryption. Supercharge your lakehouse by combining open formats with BigQuery’s highly scalable, real-time metadata through the general availability (GA) of BigLake tables for Apache Iceberg in BigQuery, enabling high-throughput streaming, auto-reclustering, multi-table transactions (coming soon), and native integration with Vertex AI, so that you can harness the power of Google Cloud AI with your lakehouse.
AI-powered governance across Google Cloud: These BigLake updates are natively supported with Dataplex Universal Catalog, providing unified and fine-grained access controls across all supported engines and enabling end-to-end governance complete with comprehensive lineage, data quality, and discoverability capabilities.
With these changes, we’re evolving BigLake into a comprehensive storage engine designed to help you build open, high-performance, and enterprise-grade lakehouses on Google Cloud using Google Cloud services, open-source, and third-party Iceberg-compatible engines, eliminating trade-offs between open and managed solutions to accelerate your data and AI innovation.
“We wanted teams across the organization to access data in a consistent and secure way — no matter where it lived or what tools they were using. Google’s BigLake was a natural choice. It provides a unified layer to access data and fully managed experience with enterprise capabilities via BigQuery — whether it’s in open table formats like Apache Iceberg or traditional tables — all without the need to move or duplicate data. Metadata quality is essential as we continue to explore potential gen AI use cases. We are utilizing BigLake Metastore and Data Catalog to help maintain high quality metadata.” – Zenul Pomal, Executive Director, CME Group
Open and interoperable
The BigLake metastore is central to BigLake’s interoperability, providing two primary catalog interfaces to connect your data across Cloud Storage and BigQuery Storage:
The Iceberg REST Catalog (Preview) provides a standard REST interface for wider compatibility. This allows Spark users, for instance, to utilize the BigLake metastore as a serverless Iceberg catalog.
The Custom Iceberg Catalog (GA) enables Spark and other open-source engines to work with BigLake tables for Apache Iceberg and interoperate with BigQuery. Its implementation is directly integrated with public Iceberg libraries, removing the need for extra JAR files.
code_block
BigLake tables for Apache Iceberg created within BigQuery can be queried by open-source and third party engines using native Apache Iceberg libraries. To enable this, BigLake automatically generates an Apache Iceberg V2 specification-compliant metadata snapshot. This snapshot is registered in the BigLake metastore, allowing open-source engines to query the data through the custom Iceberg catalog integration. Importantly, these metadata snapshots are kept current by automatically refreshing after any table modification, for example DML operations, data loads, streaming updates, or optimizations, helping to ensure that external engines work with the latest data.
aside_block
<ListValue: [StructValue([(‘title’, ‘$300 in free credit to try Google Cloud data analytics’), (‘body’, <wagtail.rich_text.RichText object at 0x3ecf71052d30>), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>
A key aspect of this enhanced interoperability is bridging analytical and transactional workloads. This is particularly powerful for AlloyDB users. Now, you can seamlessly consume your analytical BigLake tables for Apache Iceberg directly within AlloyDB (Preview). This enables PostgreSQL users to combine this rich analytical data with up-to-the-second transactional data from AlloyDB, powering AI-driven applications and real-time operational use cases by leveraging advanced AlloyDB features like semantic search, natural language interfaces, and an integrated AI query engine. This unified approach across BigQuery, AlloyDB, and open-source engines unlocks the platform value of your Iceberg data.
BigLake metastore
Supported tables
BigLake tables for Apache Iceberg
BigLake tables for Apache Iceberg in BigQuery
BigQuery tables
Storage
Cloud Storage
BigQuery
Management
Google-managed
Read / Write capabilities (R/W)
OSS engines (R/W)
BigQuery (R)
BigQuery (R/W)
OSS engines (R/W) using BigQuery Storage API
OSS engines (R) using Iceberg libraries
BigQuery (R/W)
OSS engines (R/W) using
BigQuery Storage API
Use cases
Open lakehouse
Open lakehouse with enterprise-grade storage for advanced analytics, streaming and AI
Enterprise-grade storage for advanced analytics, streaming and AI
New high-performance Iceberg-native storage
BigLake tables for Apache Iceberg deliver an Iceberg-native storage experience directly on Cloud Storage. Whether these tables are created using open-source engines like Spark or directly from BigQuery, they help to extend Cloud Storage management capabilities for your Iceberg data. This simplifies lakehouse management by enabling advanced Cloud Storage features such as auto-class tiering and Customer-Managed Encryption Keys (CMEK). To take full advantage of Cloud Storage management capabilities for your Iceberg data, refer to our best practices guide.
code_block
<ListValue: [StructValue([(‘code’, "–Use Spark to create a BigLake table for Apache Iceberg, registered in BigLake Metastore\r\nCREATE TABLE orders_spark (id BIGINT, item STRING, amount DECIMAL(10,2))\r\nUSING iceberg\r\nLOCATION ‘gs://my_lake_bucket/orders_spark_data’;\r\n\r\nINSERT INTO orders_spark VALUES (1, ‘Laptop’, 1200.00);\r\n“`bash\r\n# Optimize GCS storage costs for your Iceberg data (CLI)\r\ngsutil autoclass set on gs://my_lake_bucket"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ecf70de0e80>)])]>
Beyond the foundational Cloud Storage integration, you can leverage BigLake tables for Apache Iceberg in BigQuery. These tables, now generally available, combine open formats with BigQuery’s highly scalable, real-time metadata. This powerful combination unlocks a suite of advanced capabilities, including:
High-throughput streaming ingestion from various sources (like Spark, Flink, Dataflow, Pub/Sub, and Kafka) via BigQuery’s Write API, scaling to tens of GiB/second with zero-latency reads
Native integration with Vertex AI
Automated table management features like compaction and garbage collection
Performance optimizations such as auto-reclustering
Fine-grained DML and multi-table transactions (coming soon in preview).
This enterprise-ready, fully managed table experience, familiar to BigQuery users, maintains the openness and interoperability of Apache Iceberg to deliver the best of both worlds.
code_block
<ListValue: [StructValue([(‘code’, "– Create BigLake table for Apache Iceberg in BigQuery, stored on GCS\r\nCREATE OR REPLACE TABLE my_lake_ds.inventory_bq (item_id STRING, qty INT64)\r\nWITH CONNECTION `us.my_bl_connection`\r\nOPTIONS (\r\n storage_uri = ‘gs://my_lake_bucket/inventory_bq_data’,\r\n table_format = ‘ICEBERG’,\r\n file_format = ‘PARQUET’\r\n);\r\n\r\nINSERT INTO my_lake_ds.inventory_bq VALUES (‘Laptop’, 50);\r\nUPDATE my_lake_ds.inventory_bq SET qty = 49 WHERE item_id = ‘Laptop’;\r\n\r\n– Perform multi-table transactions\r\nBEGIN TRANSACTION;\r\n — Example: Record a new order\r\n INSERT INTO my_lake_ds.orders_bq (id, item, amount) VALUES (2, ‘Mouse’, 25.00);\r\n — Example: Update inventory for the ordered item\r\n UPDATE my_lake_ds.inventory_bq SET qty = qty – 1 WHERE item_id = ‘Mouse’;\r\nCOMMIT TRANSACTION;"), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3ecf643918b0>)])]>
AI-powered governance across Google Cloud
BigLake integrates natively with Dataplex Universal Catalog, helping to ensure that governance policies defined centrally in Dataplex are consistently enforced across multiple engines. This integration supports table-level access control for direct Cloud Storage access. Fine-grained access control is automatically available for queries within BigQuery; for open-source engines, it can be achieved using Storage API connectors.
Beyond access management, BigLake’s Dataplex integration significantly enriches overall governance for BigQuery tables and BigLake tables for Apache Iceberg (created via the custom Iceberg catalog). Key capabilities include:
Comprehensive data understanding: Native support for search, discovery, profiling, data quality checks, and end-to-end data lineage within a multi-runtime architecture.
AI-powered exploration: Dataplex simplifies data exploration with AI-powered semantic search. Its knowledge graph also automatically suggests relevant questions using AI generated insights for your BigQuery and Iceberg data, helping to jumpstart analysis.
Crucially, Dataplex’s end-to-end governance benefits apply to your Iceberg data seamlessly through BigLake’s native integration, without requiring separate registration or enablement steps.
What’s next
At Google Cloud Next ‘25 we demonstrated how fine-grained DML, multi-statement transactions, and change data capture support let you simplify your Apache Iceberg lakehouse for advanced data-processing use cases. These features will be launching soon and support for remaining capabilities will continue to roll out in upcoming months. Or, explore BigLake capabilities and watch the latest demos on our webpage or get started with BigLake tables for Apache Iceberg and BigLake metastore using this guide.
AI Summary and Description: Yes
Summary: The text discusses the advancements in Google Cloud’s BigLake, which enhances data management by providing open, flexible architectures for Apache Iceberg. The innovations facilitate interoperability among different systems and improve enterprise-level storage capabilities without sacrificing flexibility. This development holds significant relevance for professionals focusing on cloud computing, data governance, and integration of AI solutions.
Detailed Description:
The passage outlines recent innovations in Google Cloud’s BigLake, an engine aimed at simplifying the management of open data lakehouses. Below are the major points highlighted in the text:
– **Introduction of BigLake**:
– BigLake serves as a bridge between different storage systems, enhancing the interoperability and management of data in cloud environments.
– It aims to combine the benefits of open formats like Apache Iceberg with managed storage capabilities.
– **Key Features of BigLake**:
– **Open Interoperability**:
– The fully managed BigLake Metastore allows for efficient metadata management across various data engines (e.g., BigQuery, Spark, Flink).
– Supports the Iceberg REST Catalog API for broad compatibility.
– **High-Performance Iceberg-Native Storage**:
– Integrates with Google Cloud Storage for improved table maintenance, including automatic compaction and encryption.
– Enables high-throughput streaming and complex transaction management, optimizing data handling and reducing latency.
– **AI-Powered Governance**:
– BigLake is equipped with capabilities for fine-grained access controls, ensuring data governance is maintained across all supported engines.
– It leverages Dataplex for comprehensive oversight, enabling users to manage data quality and lineage effectively.
– **Use Cases**:
– Offers solutions for both analytical and transactional workloads, providing a unified approach to data management.
– Supports real-time operational needs and AI-driven applications through seamless data access and processing capabilities.
– **Strategic Importance**:
– Addresses the need for enterprises to balance the flexibility of open data formats with the structure and performance of managed data solutions.
– Fosters an environment where advanced analytics and AI functionalities can flourish, empowering organizations to innovate swiftly with their data.
Overall, these advancements significantly bolster Google Cloud’s capabilities in data management, offering organizations a robust platform to operate their data lakehouses efficiently while ensuring security and compliance across various use cases.