Source URL: https://cloud.google.com/blog/products/data-analytics/how-gemini-in-bigquery-helps-with-data-engineering-tasks/
Source: Cloud Blog
Title: How to use gen AI for better data schema handling, data quality, and data generation
Feedly Summary: In the realm of data engineering, generative AI models are quietly revolutionizing how we handle, process, and ultimately utilize data. For example, large language models (LLMs) can help with data schema handling, data quality, and even data generation.
Building upon the recently released Gemini in BigQuery Data preparation capabilities, this blog showcases areas where gen AI models are making a significant impact in data engineering with automated solutions for schema management, data quality automation, and generation of synthetic and structured data from diverse sources, providing practical examples and code snippets.
1. Data schema handling: Integrating new datasets
Data movement and maintenance is an ongoing challenge across all data engineering teams. Whether it’s moving data between systems with different schemas or integrating new datasets into existing data products, the process can be complex and error-prone. This is often exacerbated when dealing with legacy systems; in fact, 32% of organizations cite migrating the data and the app as their biggest challenge, according to Flexera’s 2024 State of the Cloud Report.
Gen AI models offer a powerful solution by assisting in automating schema mapping and transformation on an ongoing basis. Imagine migrating customer data from a legacy CRM system to a new platform, and combining it with additional external datasets in BigQuery. The schemas likely differ significantly, requiring intricate mapping of fields and data types. Gemini, our most capable AI model family to date, can analyze both schemas and generate the necessary transformation logic, significantly reducing manual effort and potential errors.
A common approach to data schema handling that we’ve seen from data engineering teams involves creating a lightweight application that receives messages from Pub/Sub, retrieves relevant dataset information from BigQuery and Cloud Storage, and uses the Vertex AI Gemini API to map source fields to target fields and assign a confidence score. Here is example code showing a FunctionDeclaration to perform the mapping-confidence task:
code_block