Cloud Blog: How to simplify building RAG pipelines in BigQuery with Document AI Layout Parser

Source URL: https://cloud.google.com/blog/products/data-analytics/bigquery-and-document-ai-layout-parser-for-document-preprocessing/
Source: Cloud Blog
Title: How to simplify building RAG pipelines in BigQuery with Document AI Layout Parser

Feedly Summary: Document preprocessing is a common hurdle when building retrieval-augmented generation (RAG) pipelines. It often requires Python skills and external libraries to parse documents like PDFs into manageable chunks that can be used to generate embeddings. In this blog, we look at new capabilities within BigQuery and Document AI that simplify this process, and take you through a step-by-step example.
Streamline document processing in BigQuery
BigQuery now offers the ability to preprocess documents for RAG pipelines and other document-centric applications through its close integration with Document AI. The ML.PROCESS_DOCUMENT function, now generally available, can now access new processors, including Document AI’s Layout Parser processor, allowing you to parse and chunk PDF documents with SQL syntax.
The GA of ML.PROCESS_DOCUMENT provides developers with new benefits:

Improved scalability: The ability to handle larger documents up to 100 pages and process them faster
Simplified syntax: A streamlined SQL syntax makes it possible to interact with Document AI processors for easier integration into your RAG workflows
Document chunking: Access to additional Document AI processor capabilities like Layout Parser, to generate document chunks necessary for RAG pipelines

In particular, document chunking is a critical — but challenging — component of building a RAG pipeline. Layout Parser in Document AI simplifies this process. We’ll explore how this works in BigQuery and demonstrate its effectiveness using a real-world scenario shortly.

aside_block
), (‘btn_text’, ‘Start building for free’), (‘href’, ‘http://console.cloud.google.com/freetrial?redirectPath=/bigquery/’), (‘image’, None)])]>

Document preprocessing for RAG
Breaking down large documents into smaller, semantically related units improves the relevance of the retrieved information, leading to more accurate answers from a large language model (LLM).
Generating metadata such as document source, chunk location, and structural information alongside chunks can further enhance your RAG pipeline, making it possible to filter, refine your search results, and debug your code.
The following diagram provides a high-level overview of the preprocessing steps in a basic RAG pipeline:

Build a RAG pipeline in BigQuery
Comparing financial documents like earnings statements can be challenging due to their complex structure and mix of text, figures, and tables. Let’s demonstrate how to build a RAG pipeline in BigQuery using Document AI’s Layout Parser to analyze the Federal Reserve’s 2023 Survey of Consumer Finances (SCF) report. Feel free to follow along in the notebook here.
Dense financial documents like the Federal Reserve’s SCF report present significant challenges for traditional parsing techniques. This document spans nearly 60 pages and contains a mix of text, detailed tables, and embedded charts, making it difficult to reliably extract information. Document AI’s Layout Parser excels in these scenarios, effectively identifying and extracting key information from complex document layouts such as these.
Building a BigQuery RAG pipeline with Document AI’s Layout Parser consists of the following broad steps.
1. Create a Layout Parser processorIn Document AI, create a new processor with the type LAYOUT_PARSER_PROCESSOR. Then create a remote model in BigQuery that points to this processor, allowing BigQuery to access and process the documents.
2. Call the processor to create chunksTo access PDFs in Google Cloud Storage, begin by creating an object table over the bucket with the earnings statements. Then, use the ML.PROCESS_DOCUMENT function to pass the objects through to Document AI and return results in BigQuery. Document AI analyzes the document and chunks the PDF. The results are returned as JSON objects and can easily be parsed to extract metadata like source URI and page number.

Sample image of a table element in the Survey of Consumer Finances report (p.12)

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT * FROM ML.PROCESS_DOCUMENT(\r\n MODEL docai_demo.layout_parser,\r\n TABLE docai_demo.demo,\r\n PROCESS_OPTIONS => (\r\nJSON \'{“layout_config": {"chunking_config": {"chunk_size": 300}}}\’)\r\n);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e54237905b0>)])]>

Sample document chunk after parsing the JSON from ML.PROCESS_DOCUMENT

3. Create vector embeddings for the chunksTo enable semantic search and retrieval, we’ll generate embeddings for each document chunk using the ML.GENERATE_EMBEDDING function and write them to a BigQuery table. This function takes two arguments:

A remote model, which calls a Vertex AI embedding endpoints
A column from a BigQuery table that contains data for embedding

Vector embeddings in BigQuery returned from ML.GENERATE_EMBEDDINGS

4. Create a vector index on the embeddingsTo efficiently search through large chunks based on semantic similarity, we’ll create a vector index on the embeddings. Without a vector index, performing a search requires comparing each query embedding to every embedding in your dataset, which is computationally expensive and slow when dealing with a large number of chunks. Vector indexes use techniques like approximate nearest neighbor search to speed up this process.

code_block
<ListValue: [StructValue([(‘code’, ‘CREATE VECTOR INDEX my_index\r\nON docai_demo.embeddings(ml_generate_embedding_result)\r\nOPTIONS(index_type = "TREE_AH",\r\n distance_type = "EUCLIDIAN"\r\n);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5423790fa0>)])]>

5. Retrieve relevant chunks and send to LLM for answer generationNow we can perform a vector search to find chunks that are semantically similar to our input query. In this case, we ask how typical family net worth changed in the three years this report covers.

code_block
<ListValue: [StructValue([(‘code’, ‘SELECT\r\n ml_generate_text_llm_result AS generated,\r\n prompt\r\nFROM\r\n ML.GENERATE_TEXT( MODEL `docai_demo.gemini_flash`,\r\n (\r\n SELECT\r\n CONCAT( \’Did the typical family net worth change? How does this compare the SCF survey a decade earlier? Be concise and use the following context:\’,\r\n STRING_AGG(FORMAT("context: %s and reference: %s", base.content, base.uri), \’,\\n\’)) AS prompt,\r\n FROM\r\n VECTOR_SEARCH( TABLE \r\n `docai_demo.embeddings`,\r\n \’ml_generate_embedding_result\’,\r\n (\r\n SELECT\r\n ml_generate_embedding_result,\r\n content AS query\r\n FROM\r\n ML.GENERATE_EMBEDDING( MODEL `docai_demo.embedding_model`,\r\n (\r\n SELECT\r\n \’Did the typical family net worth increase? How does this compare the SCF survey a decade earlier?\’ AS content\r\n )\r\n ) \r\n ),\r\n top_k => 10,\r\n OPTIONS => \'{"fraction_lists_to_search": 0.01}\’) \r\n ),\r\n STRUCT(512 AS max_output_tokens, TRUE AS flatten_json_output)\r\n );’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e5423790cd0>)])]>

The retrieved chunks are then sent through the ML.GENERATE_TEXT function, which calls the Gemini 1.5 Flash endpoint and generates a concise answer to our question.

And we get an answer: median family net worth increased 37% from 2019 to 2022, which is a significant increase compared to the period a decade earlier, which noted a 2% decrease. Notice that if you check the original document, this information is within text, tables, and footnotes — traditionally areas that are tricky to parse and infer results together!
This example demonstrated a basic RAG flow, but real-world applications often require continuous updates. Imagine a scenario where new financial reports are added daily to a Cloud Storage bucket. To keep your RAG pipeline up-to-date, consider using BigQuery Workflows or Cloud Composer to incrementally process new documents and generate embeddings in BigQuery. Vector indexes are automatically refreshed when the underlying data changes, ensuring that you always query the most current information.
Get started with document parsing in BigQuery
The integration of Document AI’s Layout Parser with BigQuery makes it easier for developers to build powerful RAG pipelines. By leveraging ML.PROCESS_DOCUMENT and other BigQuery machine learning functions, you can streamline document preprocessing, generate embeddings, and perform semantic search — all within BigQuery using SQL.
Ready to dive deeper? Explore the resources and documentation linked below:

Learn more about the BigQuery ML.PROCESS_DOCUMENT function

Process documents with Layout Parser in Document AI

Access and run the complete notebook for the sample scenario above

If you have feedback or questions, let us know at bqml-feedback@google.com.

AI Summary and Description: Yes

Summary: The text discusses advancements in BigQuery and Document AI that facilitate the document preprocessing necessary for building retrieval-augmented generation (RAG) pipelines. It highlights how the new ML.PROCESS_DOCUMENT function and Layout Parser processor improve document handling, making it more scalable and easier to create embeddings for semantic search.

Detailed Description:

The provided content elaborates on the integration of BigQuery with Document AI, which significantly enhances document preprocessing for retrieval-augmented generation (RAG) pipelines. This innovation is particularly relevant for professionals in AI and cloud computing, focusing on the following key points:

– **Document Preprocessing Challenges**: Traditional methods for processing complex documents (like PDFs) into manageable data chunks require programming knowledge and can be cumbersome.

– **New Capabilities**:
– **BigQuery Integration**: The ML.PROCESS_DOCUMENT function has been introduced to improve document parsing capabilities, simplifying workflows with SQL syntax.
– **Layout Parser**: A new processor available in Document AI that efficiently parses and chunks large documents, making them suitable for RAG applications.

– **Benefits of ML.PROCESS_DOCUMENT**:
– **Scalability**: Capable of handling larger documents (up to 100 pages) more rapidly.
– **Simplified Syntax**: SQL syntax allows for straightforward integration with document AI processors.
– **Enhanced Document Chunking**: Facilitates the creation of semantically relevant document chunks critical for RAG pipeline success.

– **Use Case**:
– The example within the text highlights how to build a RAG pipeline to analyze complex financial documents. This demonstrates the practical application of these features in real scenarios, such as parsing the Federal Reserve’s Survey of Consumer Finances report.

– **Steps for Building a RAG Pipeline**:
1. **Create Layout Parser Processor**: In Document AI to manage document analysis.
2. **Chunk Creation**: Using the ML.PROCESS_DOCUMENT to parse PDF documents.
3. **Generate Embeddings**: Create vector embeddings for each document chunk using the ML.GENERATE_EMBEDDING function.
4. **Create a Vector Index**: To efficiently perform semantic searches across document chunks.
5. **Retrieve Relevant Data**: Utilize vector search to find and generate answers from relevant document chunks.

– **Continuous Updates**: The text emphasizes the necessity for RAG pipelines to adapt continuously, proposing services like BigQuery Workflows or Cloud Composer for optimizing document processing.

Key Insights for Security and Compliance Professionals:
– Understanding the capabilities and enhancements in document processing can significantly improve data handling and compliance with data governance.
– The integration of sophisticated AI tools in cloud environments raises questions about data security, requiring professionals to ensure that documents—potentially containing sensitive information—are processed and stored in compliance with regulations.
– As companies rely more on AI for decision-making processes, professionals must ensure that semantic search integrity remains intact to avoid misinformation derived from improperly parsed data.

Overall, the advancements discussed in the text illustrate both a technological leap in document processing and the implications for secure and compliant handling of potentially sensitive data in today’s AI-driven ecosystems.