Source URL: https://cloud.google.com/blog/products/data-analytics/search-indexes-with-column-granularity-in-bigquery/
Source: Cloud Blog
Title: New column-granularity indexing in BigQuery offers a leap in query performance
Feedly Summary: BigQuery delivers optimized search/lookup query performance by efficiently pruning irrelevant files. However, in some cases, additional column information is required for search indexes to further optimize query performance. To help, we recently announced indexing with column granularity, which lets BigQuery pinpoint relevant data within columns, for faster search queries and lower costs.
BigQuery arranges table data into one or more physical files, each holding N rows. This data is stored in a columnar format, meaning each column has its own dedicated file block. You can learn more about this in the BigQuery Storage Internals blog. The default search index is at the file level, which means it maintains mappings from a data token to all the files containing it. Thus, at query time, the search index helps reduce the search space by only scanning those relevant files. This file-level indexing approach excels when search tokens are selective, appearing in only a few files. However, scenarios arise where search tokens are selective within specific columns but common across others, causing these tokens to appear in most files, and thus diminishing the effectiveness of file-level indexes.
For example, imagine a scenario where we have a collection of technical articles stored in a simplified table named TechArticles with two columns — Title and Content. And let’s assume that the data is distributed across four files, as shown below.
Our goal is to search for articles specifically related to Google Cloud Logging. Note that:
The tokens “google", "cloud", and "logging" appear in every file.
Those three tokens also appear in the "Title" column, but only in the first file.
Therefore, the combination of the three tokens is common overall, but highly selective in the "Title" column.
Now, let’s say, we create a search index on both columns of the table with the following DDL statement:
CREATE SEARCH INDEX myIndex ON myDataset.TechArticles(Title, Content);
The search index stores the mapping of data tokens to the data files containing the tokens, without any column information; the index looks like the following (showing the three tokens of interest: "google", "cloud", and "logging"):
With the usual query SELECT * FROM TechArticles WHERE SEARCH(Title, "Google Cloud Logging"), using the index without column information, BigQuery ends up scanning all four files, adding unnecessary processing and latency to your query.
aside_block
Indexing with column granularity
Indexing with column granularity, a new public preview feature in BigQuery, addresses this challenge by adding column information in the indexes. This lets BigQuery leverage the indexes to pinpoint relevant data within columns, even when the search tokens are prevalent across the table’s files.
Let’s go back to the above example. Now we can create the index with COLUMN granularity as follows:
CREATE SEARCH INDEX myIndex ON myDataset.TechArticles(Title, Content)
OPTIONS (default_index_column_granularity = ‘COLUMN’);
The index now stores the column information associated with each data token. The index is as follows:
Using the same query SELECT * FROM TechArticles WHERE SEARCH(Title, "Google Cloud Logging") as above but using the index with column information, BigQuery now only needs to scan file1 since the index lookup is the intersection of the following:
Files where Token=’google’ AND Column=’Title’ (file1)
Files where Token=’cloud’ AND Column=’Title’ (file1, file2, file3, and file4)
Files where Token’=’logging’ AND Column=’Title’ (file1).
Performance improvement benchmark results
We benchmarked query performance on a 1TB table containing Google Cloud Logging data of an internal Google test project with the following query:
SELECT COUNT(*)
FROM `dataset.log_1T`
WHERE SEARCH((logName, trace, labels, metadata), ‘appengine’);
In this benchmark query, the token ‘appengine’ appears infrequently in the columns used for query filtering, but is more common in other columns. The default search index already helped reduce a large portion of the search space, resulting in half the execution time, reducing processed bytes and slot usage. By employing column granularity indexing, the improvements are even more significant.
In short, column-granularity indexing in BigQuery offers the following benefits:
Enhanced query performance: By precisely identifying relevant data within columns, column-granularity indexing significantly accelerates query execution, especially for queries with selective search tokens within specific columns.
Improved cost efficiency: Index pruning results in reduced bytes processed and/or slot time, translating to improved cost efficiency.
This is particularly valuable in scenarios where search tokens are selective within specific columns but common across others, or where queries frequently filter or aggregate data based on specific columns.
Best practices and getting started
Indexing with column granularity represents a significant advancement in BigQuery’s indexing capabilities, letting you achieve greater query performance and cost efficiency.
For best results, consider the following best practices:
Identify high-impact columns: Analyze your query patterns to identify columns that are frequently used in filters or aggregations and would benefit from column-granularity indexing.
Monitor performance: Continuously monitor query performance and adjust your indexing strategy as needed.
Consider indexing and storage costs: While column-granularity indexing can optimize query performance, be mindful of potential increases in indexing and storage costs.
To get started, simply enable indexing with column granularity. For more information, refer to the CREATE SEARCH INDEX DDL documentation.
AI Summary and Description: Yes
Summary: The text discusses BigQuery’s new feature of indexing with column granularity, which enhances query performance by optimizing the search process in data analysis. This innovation is especially relevant for professionals in cloud computing, data analytics, and security, as it addresses efficiency and cost-effectiveness in data processing.
Detailed Description: The content addresses the enhancements introduced in BigQuery regarding indexing with column granularity. This feature directly impacts how search queries are executed, particularly in cloud environments, making it an essential topic for professionals focused on cloud computing and data security. Here are the key points:
– **Current Indexing Limitations**: BigQuery traditionally uses file-level indexing, which maintains mappings for data tokens to files, but runs into inefficiencies when common tokens exist in a large subset of files.
– **Introduction of Column Granularity**:
– The new column-granularity indexing allows BigQuery to refine its search capabilities further by incorporating column-specific information into the indexes.
– This means that during queries, relevant files can be pinpointed, dramatically reducing the data that needs to be scanned.
– **Benchmark Performance Improvements**:
– The performance metrics highlighted a significant decrease in execution time and resource usage when using column-granularity indexing, emphasizing its role in cost efficiency.
– **Benefits of Column-Granularity Indexing**:
– **Enhanced Query Performance**: This indexing is ideal for queries with selective search tokens within individual columns, leading to faster execution.
– **Improved Cost Efficiency**: Reduced bytes processed during queries lead to lower operational costs, making this feature valuable for organizations handling large datasets.
– **Best Practices for Implementation**:
– Identify high-impact columns that frequently appear in search queries.
– Continuously monitor and adapt indexing strategies based on observed query performance.
– Be mindful of the potential costs related to both indexing and storage when adopting this feature.
– **Getting Started**: Users are encouraged to enable column-granularity indexing to take full advantage of the performance improvements and refer to the official documentation for detailed implementation instructions.
This feature is particularly significant for data security professionals and cloud computing experts, as it illustrates how advancements in indexing can lead to more efficient data management practices within cloud environments, enhancing both performance and cost-effectiveness while they manage sensitive data.