Cloud Blog: Spanner columnar engine: Powering next-generation analytics on operational data

Source URL: https://cloud.google.com/blog/products/databases/spanners-columnar-engine-unites-oltp-and-analytics/
Source: Cloud Blog
Title: Spanner columnar engine: Powering next-generation analytics on operational data

Feedly Summary: For years, organizations have struggled with the workload conflict between online transaction processing (OLTP) and analytical query processing. OLTP systems such as Spanner are optimized for high-volume, low-latency transactions, and use row-oriented storage that’s efficient for individual record access. Analytical workloads, conversely, require rapid aggregations and scans across large datasets. These tasks are traditionally handled by separate data warehouses that employ columnar storage and incoming data pipelines from transaction systems. Separating OLTP and analytical workflows requires periodic data transfers, which often leads to stale data, complex ETL pipelines, and operational overhead.
Today, we’re thrilled to announce Spanner columnar engine, which brings new analytical capabilities directly to Spanner databases. Just as AlloyDB’s columnar engine enhanced PostgreSQL analytics, Spanner’s new columnar engine lets you analyze vast amounts of operational data in real-time, all while maintaining Spanner’s global consistency, high availability, and strong transactional guarantees — and without impacting transactional workloads.
The power of Spanner columnar engine helps organizations, such as Verisoul.ai, eliminate the problem of data silos typically found when combining high volume transaction systems with fast analytics. “Detecting fraud in real time is only half the story—showing customers the ‘why’ helps them act faster and turn trust into measurable ROI,” says Raine Scott and Niel Ketkar, founders of Verisoul.ai, a machine-learning platform that stops fake users and fraud. “Spanner’s new columnar engine allows high-velocity transactional writes and rich analytics in one place, eliminating data copies and replication lag so customers get instant answers.” 
Columnar storage meets vectorized execution

Figure: Spanner columnar engine architecture

The heart of the Spanner columnar engine is its innovative architecture, which combines columnar storage with vectorized query execution.
Columnar Storage in Spanner: Hybrid Architecture
Unlike traditional row-oriented storage, where an entire row is stored contiguously, columnar storage stores data column by column. This offers several advantages for analytical workloads:

Reduced I/O: Analytical queries often access only a few columns at a time. With columnar storage, only the relevant columns need to be read from disk, significantly reducing I/O operations.

Improved compression: Data within a single column is typically of the same data type and often exhibits similar storage patterns, leading to much higher compression ratios. This means more data can fit in memory and fewer bytes need to be read.

Efficient scans: When scanning a column, consecutive values can be processed together, for more efficient data processing.

Spanner columnar engine integrates a columnar format alongside its existing row-oriented storage. This unified transactional and analytical processing design allows Spanner to maintain its OLTP performance while accelerating analytical queries up to 200X on your live operational data.
Vectorized execution: turbocharging your queries
To complement columnar storage, the columnar engine makes use of Spanner’s vectorized execution capabilities. While traditional query engines process data tuple-by-tuple (row by row), a vectorized engine processes data in batches (vectors) of rows. This approach dramatically improves CPU utilization, with:

Reduced function call overhead: Instead of calling a function for each individual row, vectorized engines call functions once for an entire batch, significantly reducing overhead.

Optimized memory access: Vectorized processing often results in more cache-friendly memory access patterns, further boosting performance.

The combination of columnar storage and vectorized execution means that analytical queries on Spanner can run orders of magnitude faster, allowing for real-time insights on your global-scale data.
Better with BigQuery: Accelerating federated queries
The Spanner columnar engine takes its integration with Google’s Data Cloud ecosystem a step further, specifically enhancing integrations between Spanner and BigQuery. For enterprises that leverage BigQuery for data warehousing and analytics, federating queries directly to Spanner has always been a valuable capability. Now, with the Spanner columnar engine, this integration becomes even more potent, by delivering faster insights on operational data.
Data Boost, Spanner’s fully managed, elastically scalable compute service for analytical workloads, is at the forefront of this acceleration. When BigQuery issues a federated query to Spanner, and that query can benefit from columnar scans and vectorized execution, Data Boost automatically leverages the Spanner columnar engine. This provides:

Faster analytical insights: Complex analytical queries initiated from BigQuery that target your Spanner data execute significantly faster, bringing near-real-time operational data into your broader analytical landscape.

Reduced impact on OLTP: Data Boost helps ensure that analytical workloads are offloaded from your primary Spanner compute resources, preventing impact on transactional operations.

Simplified data architecture: You can get the best of both worlds – Spanner’s transactional consistency and BigQuery’s analytical prowess – without the need for complex ETL pipelines to duplicate data.

This integration empowers data analysts and scientists to combine Spanner’s live operational data with other datasets in BigQuery for richer, more timely insights and decision-making.
Columnar engine in action: Accelerating your analytical queries
Let’s look at some sample queries that should see significant acceleration with the Spanner columnar engine. These types of queries, common in analytical and graph workloads, benefit from columnar scans and vectorized processing.
Scenario: Imagine a large e-commerce database; for demonstration purposes, we’ll use the same schema as the TPC-H benchmark.
Query 1: Revenue from discounted shipments in a given Year
SQL

code_block
= date "1994-01-01"\r\n AND l.l_shipdate < date_add(date "1994-01-01", INTERVAL 1 year)\r\n AND l.l_discount BETWEEN 0.08 – 0.01 AND 0.08 + 0.01\r\n AND l.l_quantity < 25;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e333c7ada60>)])]>

Acceleration: This query heavily benefits from scanning only the l_shipdate, l_extendedprice, l_discount, and l_quantity columns from the lineitem table. Vectorized execution rapidly applies the date, discount, and quantity filters to identify qualifying rows.
Query 2: Total quantity of non-discounted items
SQL

code_block
<ListValue: [StructValue([(‘code’, ‘@{scan_method=columnar}\r\nSELECT\r\n sum(l_quantity)\r\nFROM\r\n lineitem\r\nWHERE\r\n l_discount = 0;’), (‘language’, ‘lang-sql’), (‘caption’, <wagtail.rich_text.RichText object at 0x3e333c325250>)])]>

Acceleration: This query heavily benefits from scanning only the l_discount and l_quantity columns from the lineitem table. Vectorized execution rapidly applies the equality filter (l_discount = 0) to identify matching rows.
Query 3: Item count and discount range for specific tax brackets..
SQL

code_block
<ListValue: [StructValue([(‘code’, ‘@{scan_method=columnar}\r\nSELECT\r\n count(*),\r\n min(l_discount),\r\n max(l_discount)\r\nFROM\r\n lineitem\r\nWHERE\r\n l_tax IN (0.01, 0.02);’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e333c325340>)])]>

Acceleration: This query benefits heavily from scanning only the l_tax and l_discount columns from the lineitem table. Vectorized execution rapidly applies the IN filter on the l_tax column to identify all matching rows.
Query 4: Scan friend relationships to find the N most connected people in the graph using Spanner Graph
GQL

code_block
<ListValue: [StructValue([(‘code’, ‘@{scan_method=columnar}\r\nGRAPH social_graph\r\nMATCH (p:Person)-[k:Knows]->(:Person)\r\nRETURN COUNT(k) AS friend_count GROUP BY p ORDER BY friend_count DESC LIMIT 10;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e334f6545b0>)])]>

Acceleration:This query benefits heavily from scanning subgraphs and by loading only relevant columns from the graph.
Query 5: Perform a K-nearest neighbor vector similarity search to retrieve the top 10 most semantically similar embeddings with perfect recall
GQL

code_block
<ListValue: [StructValue([(‘code’, ‘@{scan_method=columnar}\r\nSELECT e.Id as key, COSINE_DISTANCE(@vector_param, e.Embedding) as distance\r\nFROM Embeddings e\r\nORDER BY distance LIMIT 10;’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e334f654e20>)])]>

Acceleration:This query benefits heavily from scanning contiguously stored vector embeddings and by loading only relevant columns from the table.
Get started with Spanner columnar engine today! 
The Spanner columnar engine is designed for businesses looking to unlock faster, deeper, real-time insights from their operational data without compromising Spanner’s foundational strengths. We are incredibly excited about the possibilities this opens up for developers and data analysts alike. We invite you to be among the first to try the Spanner columnar engine. Request access to the Preview of Spanner columnar engine today by signing up at bit.ly/spannercolumnar. We look forward to seeing what you build!

AI Summary and Description: Yes

**Summary:** The text discusses the introduction of the Spanner columnar engine, which enhances Google Cloud’s Spanner database by allowing for real-time analytical capabilities alongside high-volume transaction processing. This innovation aims to address the challenges of data silos in organizations and improve analytical performance significantly.

**Detailed Description:**
The announcement of Spanner’s new columnar engine reveals significant advancements for businesses that rely on both online transaction processing (OLTP) and analytical processing. Here are the central points:

– **Seamless Transition from OLTP to Analytical Processing:**
– Traditionally, organizations separate OLTP systems from analytical systems, leading to data silos, complex ETL processes, and lagging data updates.
– The Spanner columnar engine integrates these systems, enabling real-time analytics without compromising transactional workloads.

– **Benefits of the Columnar Engine:**
– **Unified Architecture:** Houses both row-oriented and columnar storage, retaining OLTP capabilities while enabling faster analytical queries.
– **Real-time Insights:** Organizations can perform real-time queries on operational data, crucial for applications like fraud detection, as mentioned by Verisoul.ai.

– **Key Features of Columnar Storage:**
– **Reduced I/O:** Analytical queries only access specific columns, significantly decreasing input/output operations and improving query speed.
– **Improved Compression Ratios:** Higher data compression within columns enables more efficient memory usage.
– **Efficient Scans:** Allows quick access and processing of data, speeding up analytical workloads.

– **Vectorized Execution:**
– Integrates with the columnar engine to process data in larger batches, improving CPU utilization and reducing memory overhead, which leads to dramatic performance improvements.

– **Enhanced Integration with Google’s Data Cloud:**
– The Spanner columnar engine enhances the functionality of BigQuery, allowing for faster insights directly from operational data without the need for extensive data duplication or complex pipelines.

– **Real-world Application Examples:**
– Queries targeting specific criteria in large datasets can see up to 200X performance improvements in efficiency.
– Use cases demonstrate how analytical queries related to e-commerce and fraud detection benefit from the accelerated processing capabilities of the Spanner columnar engine.

In conclusion, the introduction of the Spanner columnar engine positions Google Cloud’s Spanner as a powerful tool for organizations seeking to streamline their data workflows. By supporting both transactional and analytical workloads efficiently, this development offers substantial time and resource savings for data-driven initiatives. Security and compliance professionals should take into account these advancements as they implement data strategies that leverage real-time insights while ensuring overarching data governance and compliance standards are met.