Source URL: https://cloud.google.com/blog/products/data-analytics/dataplex-discovers-and-catalogs-cloud-storage-data/
Source: Cloud Blog
Title: Dataplex Automatic Discovery makes Cloud Storage data available for Analytics and governance
Feedly Summary: In today’s data- and AI-driven world, organizations are grappling with an ever-growing volume of structured and unstructured data. This growth makes it increasingly challenging to locate the right data at the right time, and a significant portion of enterprise data remains undiscovered or underutilized — what’s often referred to as “dark data." In fact, a staggering 66% of organizations report that at least half of their data falls into this category.
To address this challenge, today we’re announcing automatic discovery and cataloging of Google Cloud Storage data with Dataplex, part of BigQuery’s unified platform for intelligent data to AI governance. This powerful capability empowers organizations to:
Automatically discover valuable data assets residing within Cloud Storage, including structured and unstructured data such as documents, files, PDFs, images, and more.
Harvest and catalog metadata for your discovered assets by keeping schema definitions up-to-date with built-in compatibility checks and partition detection, as data evolves.
Enable analytics for data science and AI use cases at scale with auto-created BigLake, external or object tables, eliminating the need for data duplication or manually creating table definitions.
aside_block
How it works
The automatic discovery and cataloging process in Dataplex is designed to be integrated and efficient, and performs the following steps:
Discovery scan: Discovery scan is configured by the user using the BigQuery Studio UI, CLI or gcloud, which scans your Cloud Storage bucket with up to millions of files, identifying and classifying data assets.
Metadata extraction: Relevant metadata, including schema definitions and partition information, is extracted from the discovered assets.
Creation of dataset and tables in BigQuery: A new dataset with numerous BigLake, external or object tables (for unstructured data) is automatically created in BigQuery with accurate, up-to-date table definitions. For scheduled scans, these tables will be updated as the data in cloud storage bucket evolves.
Analytics and AI preparation: The published dataset and tables are available for analysis, processing, data science, and AI use cases in BigQuery, as well as open-source engines like Spark, Hive, and Pig.
Catalog integration: All BigLake tables are integrated into the Dataplex catalog, making them easily searchable and accessible.
Key benefits
Dataplex’s automatic discovery and cataloging feature offers a multitude of benefits for organizations:
Enhanced data visibility: Gain a clear understanding of your data and AI assets across Google Cloud, eliminating the guesswork and reducing the time spent searching for relevant information.
Reduced manual effort: Cut back on the toil and effort of creating table definitions manually by letting Dataplex scan the bucket and create numerous BigLake tables that correspond to your data in Cloud Storage.
Accelerated analytics and AI: Integrate the data that’s discovered into your analytics and AI workflows, unlocking valuable insights and driving informed decision-making.
Simplified data access: Provide authorized users with easy access to the data they need, while maintaining appropriate security and control measures.
For Storage admins who are interested in Cloud Storage management and gaining insights into their entire storage estate, please refer to Understand your Cloud Storage footprint with AI-powered queries and insights
Unlock your data’s potential
Automatic discovery and cataloging in Dataplex marks a significant step forward in helping organizations unlock the full potential of their data. By eliminating the challenges associated with dark data and providing a comprehensive, searchable catalog of your Cloud Storage assets, Dataplex empowers you to make data-driven decisions with confidence.
We encourage you to explore this powerful new feature and experience the benefits firsthand. To learn more and get started, please visit the Dataplex documentation or contact our team for assistance.
AI Summary and Description: Yes
Summary: The text discusses Google Cloud’s Dataplex feature that allows for the automatic discovery and cataloging of data within Google Cloud Storage. This innovation addresses the challenge of “dark data,” enhancing visibility and accessibility for organizations, and supporting AI-driven data analytics.
Detailed Description:
The provided content narrates the launch of an automatic discovery and cataloging feature within Google Cloud’s Dataplex, part of BigQuery’s unified data platform. This capability reflects a vital development in managing and leveraging the increasing volumes of data organizations encounter today, especially in the context of unstructured data, which is often neglected or poorly utilized—labelled as “dark data.”
Key Points:
– **Challenge of Dark Data**:
– A major concern for organizations is the prevalence of unutilized data, with 66% indicating that at least half of their data falls into this category.
– Identifying and utilizing this data is critical for enhanced decision-making and analytics.
– **Automatic Discovery and Cataloging**:
– **Discovery Scan**: Users can configure a discovery scan using BigQuery Studio or CLI, allowing for the identification of valuable data assets within Cloud Storage.
– **Metadata Harvesting**: The feature extracts relevant metadata (like schema definitions and partition details) from the discovered data assets, ensuring that organizations have up-to-date information.
– **BigQuery Integration**: Automatically creates datasets and tables in BigQuery, facilitating seamless analytics without data duplication or manual table definitions.
– **Analytics and AI Enhancements**:
– The system enables large-scale analytics for data science and AI, with discoverable datasets becoming available for analysis across multiple engines (e.g., Spark, Hive).
– **User Accessibility and Security**:
– Authorized users benefit from streamlined data access, while security measures are integrated to ensure control over data exposure and usability.
– **Key Benefits**:
– **Enhanced Data Visibility**: Organizations achieve improved understanding and management of their data assets.
– **Reduced Manual Effort**: The automation reduces the workload for data administrators, optimizing operational efficiency.
– **Accelerated Analytics and AI**: The integration of data into analytics workflows is expedited, leading to quicker insights.
– **Simplified Data Access**: Users can easily access the data they need, contributing to informed decision-making.
Overall, the introduction of Dataplex’s automatic discovery and cataloging feature represents a significant advancement for organizations in utilizing their cloud data effectively, addressing issues of data accessibility, governance, and analytics. This development is particularly relevant to professionals in cloud computing, data management, and AI who aim to harness data to drive strategic initiatives.