Cloud Blog: Using BigQuery Omni to reduce log ingestion and analysis costs in a multi-cloud environment

Source URL: https://cloud.google.com/blog/products/data-analytics/bigquery-omni-to-reduce-the-cost-of-log-analytics/
Source: Cloud Blog
Title: Using BigQuery Omni to reduce log ingestion and analysis costs in a multi-cloud environment

Feedly Summary: In today’s data-centric businesses, it’s not uncommon for companies to operate hundreds of individual applications across a variety of platforms. These applications can produce a massive volume of logs, presenting a significant challenge for log analytics. Additionally,  the broad adoption of multi-cloud solutions complicates accuracy and retrieval, as the distributed nature of the logs can inhibit the ability to extract meaningful insights.
BigQuery Omni was designed to effectively solve this challenge, and help reduce the overall costs when compared to a traditional approach. This blog post will dive into the details. 
Log analysis involves various steps, namely:

Log data collection: collects log data from organization’s infrastructure and or applications. A common approach to collect this data is using a JSONL file format and saving it into an object storage application such as Google Cloud Storage. In a multi-cloud environment, moving raw log data between clouds can be cost prohibitive.

Log data normalization: Different applications and infrastructure generate different JSONL files. Each file has its own set of fields linked to the application/infrastructure that created it. To facilitate data analysis, these different fields are unified into a common set, allowing data analysts to conduct analyses efficiently and comprehensively across the entire environment.

Indexing and storage: Normalized data should be stored efficiently to reduce storage and query costs, and also to increase query performance. A common approach is to store logs into a compressed columnar-file format like Parquet. 

Querying and visualization: Allow organizations to execute analytics queries to identify anomalies, anti-patterns or known threads available in the log data.

Data lifecycle: As log data ages, its utility decreases, while still incurring storage costs. To optimize expenses, it’s crucial to establish a data lifecycle process. A widely adopted strategy involves archiving logs after a month (querying log data older than a month is uncommon) and deleting them after a year. This approach effectively manages storage costs while ensuring that essential data remains accessible.

A common architecture
To implement log analysis in a multi-cloud environment, many organizations implement the following architecture:

This architecture has its pros and cons. 
On the plus side: 

Data lifecycle: It’s relatively easy to implement data lifecycle management by leveraging existing features from object storage solutions. For example, in Cloud Storage you can define the following data lifecycle policy: (a) delete any object older than a week — you can use it to delete your JSONL files available during the Collection step; (b) archive any object older than a month — you can use this policy for your Parquet files; and (c) delete any object older than a year — also for your Parquet files.

Low egress costs: By keeping the data local, you avoid sending high volumes of raw data between cloud providers.

On the con side:

Log data normalization: As you collect logs from different applications, you will code and maintain an Apache Spark workload for each one. In an age where (a) engineers are a scarce resource, and (b) microservices adoption is growing rapidly, it’s a good idea to avoid this.

Querying: Spreading your data across different cloud providers drastically reduces your analysis and visualization capabilities.

Querying: Excluding archived files created earlier in the data lifecycle is non-trivial and is prone to human error when relying on WHERE clauses to avoid partitions with archived files. One solution is to work with Iceberg Table and manage the table’s manifest by adding and removing partitions as needed. However, manually playing with the Iceberg Table manifest is complicated, and using a third-party solution just increases costs.

An improved architecture
Based on these factors, an improved solution would be to use BigQuery Omni to handle all these problems as presented in the architecture below.

One of the core benefits of this approach is the elimination of different Spark workloads and associated software engineers to code and maintain them. Another benefit of this solution is that you have a single product (BigQuery) handling the entire process, apart from storage and visualization. You also gain  benefits related to cost savings. We’ll explain each of these points below in detail.
A simplified normalization process
BigQuery’s ability to create an external table pointing to JSONL files and automatically determine their schema is a significant value. This feature is particularly useful when dealing with numerous log schema formats. For each application, a straightforward CREATE TABLE statement can be defined to access its JSONL content. Once there, you can schedule BigQuery to export the JSONL external table into compressed Parquet files partitioned by hour in Hive format. The query below is an example of an EXPORT DATA statement that can be scheduled to run every hour. The SELECT statement of this query captures only the log data ingested from the last hour and converts it into a Parquet file with normalized fields.

code_block
)])]>

A unified querying process across cloud providers
Having the same data warehouse platform that spans multiple cloud providers already brings benefits to the querying process, but BigQuery Omni can also execute cross-cloud joins — a game changer in Log Analytics. Before BigQuery Omni, combining log data from different cloud providers was a challenge. Due to the volume of data, sending the raw data to a single master cloud provider generates significant egress costs, on the other hand pre-processing and filtering it reduces your ability to perform analytics on it. With cross-cloud joins, you can run a single query across multiple clouds and analyze its results. 
Helps to Reduce TCO
The final and probably the most important benefit from this architecture is it helps to reduce the total cost of ownership (TCO). This can be measure in three ways:

Reduced engineering resources: Removing Apache Spark from this process brings two benefits. First, there’s no need for a software engineer to work on and maintain Spark code. Second, the deployment process is faster and can be executed by the log analytics team using standard SQL queries. As a PaaS with a shared responsibility model, BigQuery and BigQuery Omni extend that model to data in AWS and Azure.

Reduced compute resources: Apache Spark may not always offer the most cost-effective environment. An Apache Spark solution comprises multiple layers: the virtual machine (VM), the Apache Spark platform, and the application itself. In contrast, BigQuery utilizes slots (virtual CPUs, not VMs) and an export query that is converted into C-compiled code during the export process can result in faster performance for this specific task when compared to Apache Spark. 

Reduced egress costs: BigQuery Omni allows you to process data in-situ and egress only results through cross-cloud joins, avoiding the need to move raw data between cloud providers to have a consolidated view of the data.

How should you use BigQuery in this environment?
BigQuery offers a choice of two compute pricing models for running queries:

On-demand pricing (per TiB) – With this pricing model, you are charged for the number of bytes processed by each query, and the first 1 TiB of query data processed per month is free.  As log analytics tasks consume a large volume of data, we do not recommend using this model.

Capacity pricing (per slot-hour) – With this pricing model, you are instead charged for compute capacity used to run queries, measured in slots (virtual CPUs) over time. This model takes advantage of BigQuery editions. You can use the BigQuery autoscaler or purchase slot commitments, which are dedicated capacity always available for your workloads, at a lower price than on-demand.

We executed an empirical test and allocated 100 slots (baseline 0, max slots 100) to a project focused on export log JSONL data into a compressed Parquet format. With this setup, BigQuery was able to process 1PB of data per day without consuming all 100 slots.
In this blog post, we presented an architecture aiming to support the TCO reduction of Log Analytics workloads in a multi-cloud environment, by replacing Apache Spark applications by SQL queries running on BigQuery Omni. This approach helps to reduce engineering, compute and egress costs, while at the same time minimizing overall DevOps complexity, which can bring value to your unique data environment.

AI Summary and Description: Yes

Summary: The text outlines the challenges of log analytics in multi-cloud environments and presents BigQuery Omni as an innovative solution. It highlights how this architecture improves log data management, reduces costs, and simplifies processes for organizations dealing with extensive logs from various applications across multiple cloud platforms.

Detailed Description: The provided content focuses on log analytics in multi-cloud environments, emphasizing the complexity organizations face due to the volume and diversity of log data. The discussion accentuates the efficiency and cost-effectiveness of BigQuery Omni as a central tool for addressing these challenges.

* **Key Points:**
– **Log Data Management Challenges:**
– Companies deal with large volumes of logs from numerous applications across different platforms.
– Multi-cloud solutions complicate log accuracy and retrieval, impacting meaningful insights.

– **BigQuery Omni Solution:**
– Offers a cohesive approach to log analytics, providing cost reductions compared to traditional methods.
– Implements efficient log data collection, normalization, and storage.

* **Log Analysis Steps:**
– **Log Data Collection:**
– Utilizes JSONL format and moves log data to object storage (e.g., Google Cloud Storage).
– **Log Data Normalization:**
– Standardizes diverse log schemas into a unified format for efficient analysis.
– **Indexing and Storage:**
– Storing logs in a compressed format (e.g., Parquet) enhances performance and cost efficiency.
– **Querying and Visualization:**
– Allows for analytics to detect anomalies and optimize security measures.
– **Data Lifecycle Management:**
– Important for managing costs as log data ages and is archived or deleted based on utility.

* **Common Architecture vs. Improved Architecture:**
– **Common Architecture:**
– Presents pros like easy data lifecycle management and low egress costs but includes cons such as complex normalization processes and querying issues across clouds.
– **Improved Architecture with BigQuery Omni:**
– Simplifies normalization processes and enhances querying capabilities with cross-cloud joins.
– Reduces Total Cost of Ownership (TCO) through:
– Lower engineering resource needs by minimizing Apache Spark dependencies.
– Greater computational resource efficiency versus traditional solutions.
– Decreased egress costs by processing data in its original location.

* **Pricing Models:**
– Discusses two pricing structures for BigQuery:
– **On-demand pricing:** Charges based on bytes processed per query.
– **Capacity pricing:** Charges based on slots allocated for running queries, promoting cost efficiency.

In conclusion, the proposed architecture not only enhances the efficiency of log analytics in multi-cloud settings but also significantly decreases complexity and operational costs for enterprises leveraging such an environment. The insights provided are critical for security and compliance professionals, as effectively managing log data is integral to maintaining robust security postures and compliance with regulations.