Cloud Blog: Intelligent code conversion: Databricks Spark SQL to BigQuery SQL via Gemini

Aug 21, 2025

—

Source URL: https://cloud.google.com/blog/products/data-analytics/automate-sql-translation-databricks-to-bigquery-with-gemini/
Source: Cloud Blog
Title: Intelligent code conversion: Databricks Spark SQL to BigQuery SQL via Gemini

Feedly Summary: As data platforms evolve and businesses diversify their cloud ecosystems, the need to migrate SQL workloads between engines is becoming increasingly common. Recently, I had the opportunity to work on translating a set of Databricks SQL queries to BigQuery SQL — a task that is deceptively complex due to differences in syntax, functions, and execution behavior.
To streamline the process, we turned to Google Gemini, a powerful AI assistant, to help bridge the gap between the two SQL dialects. In this blog post, I’ll walk you through the process, challenges we faced, how Gemini helped, and key takeaways from the experience.
The translation tightrope: Why it’s tricky
To boost operational efficiency and cut costs, we migrate analytics workloads from Databricks SQL (on Delta Lake tables) to Google BigQuery. This required rewriting numerous queries, from simple aggregations to intricate CTEs and window functions.
Databricks, with its powerful Spark SQL capabilities, and BigQuery, a serverless and highly scalable data warehouse, are both titans in the data world. However, their SQL dialects, while sharing common ANSI SQL foundations, have distinct variations. Translating between the two manually was possible, but would have been time-consuming and error-prone. This is where Google Gemini played a crucial role.
Below are some of the data type mappings between Databricks and Bigquery :

Category

Databricks SQL Data Type

BigQuery SQL Data Type

Description

Integer Types

TINYINT

INT64

8-bit integer

SMALLINT

INT64

16-bit integer

INT or INTEGER

INT64

32-bit integer

BIGINT

INT64

64-bit integer

Floating-Point Types

FLOAT or REAL

FLOAT64

64-bit floating point

DOUBLE

FLOAT64

64-bit floating point (equivalent to BigQuery’s FLOAT64)

Decimal/Exact Types

DECIMAL or NUMERIC

NUMERIC or BIGNUMERIC

Fixed-point decimal with user-defined precision and scale. BigQuery has an extended BIGNUMERIC for larger precision.

Boolean Types

BOOLEAN

BOOL

True or False

String Types

STRING or VARCHAR

STRING

Variable-length string

CHAR

Not Supported

Fixed-length string is not directly supported in BigQuery; use STRING instead.

Date and Time Types

DATE

Calendar date (year, month, day)

TIMESTAMP

Timestamp with time zone information

DATETIME

Timestamp without time zone

Syntax difference in Databricks and BigQuery
First_Value :
Databricks

code_block
)])]>

BigQuery

code_block
<ListValue: [StructValue([(‘code’, ‘FIRST_VALUE (value_expression [{RESPECT | IGNORE} NULLS])\r\nOVER over_clause\r\n\r\nover_clause:\r\n { named_window | ( [ window_specification ] ) }\r\n\r\nwindow_specification:\r\n [ named_window ]\r\n [ PARTITION BY partition_expression [, …] ]\r\n ORDER BY expression [ { ASC | DESC } ] [, …]\r\n [ window_frame_clause ]’), (‘language’, ”), (‘caption’, <wagtail.rich_text.RichText object at 0x3e0e0ed92a30>)])]>

In particular, working with H3 geospatial functions can often present unique translation hurdles. Our resources provide clear mappings, like these:

Databricks Function

BigQuery Equivalent Function

Description

h3_boundaryasgeojson(h3CellIdExpr)

ST_ASGEOJSON(jslibs.h3.ST_H3_BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in GeoJSON format.

h3_boundaryaswkb(h3CellIdExpr)

ST_ASBINARY(jslibs.h3.ST_H3_BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in WKB format.

h3_boundaryaswkt(h3CellIdExpr)

ST_ASTEXT(jslibs.h3.ST_H3_BOUNDARY(h3CellIdExpr))

Returns the polygonal boundary of the input H3 cell in WKT format.

Providing precise details for complex functions like these is crucial. In fact, we’ve found that by detailing these H3 translations, even advanced AI models like Gemini can generate more accurate and reliable BigQuery SQL from your original Databricks Spark SQL, ensuring your geospatial analyses remain intact.
Architecture overview
Before diving into the translation logic, let me show you how the pieces fit together.
Pipeline components
Source SQL Storage:

All original Databricks SQL files were stored in Google Cloud Storage .

Function mapping guide:

A curated guide that maps Databricks-specific SQL functions (e.g., First_value, UCase,etc) to their BigQuery equivalents (FIRST_VALUE, UPPER, TIMESTAMP etc.)

This guide included examples and syntax rules, which were used as input context for Gemini.

Few-shot examples:

I selected a set of hand-translated queries to serve as high-quality training prompts to improve Gemini’s consistency.

Retrieval-Augmented Generation (RAG) layer:

Before querying Gemini, I leveraged the Vertex AI RAG Engine to retrieve relevant function mappings and example translations. This ensured Gemini had grounded knowledge, improving the accuracy of the output. The RAG-enriched prompt was then sent to Gemini for translation, and the returned SQL was optionally post-processed to fix edge cases.

This ensured Gemini had grounded knowledge, improving the accuracy of the output.
Gemini API integration:

The RAG-enriched prompt was sent to Gemini for translation.

Returned SQL was optionally post-processed to fix edge cases.

Validation layer:

Translated SQL queries were validated by executing them in a BigQuery dry run mode to detect syntax issues.

Architecture diagram

Lessons learned

RAG + Gemini = Smart SQL translation: Grounding Gemini with real-world examples and mapping logic made it significantly more accurate.

A comprehensive function mapping guide is essential: Invest time in building a robust function mapping resource.

Thorough validation is the key: Use BigQuery’s dry run and information schema to ensure translated queries are safe and optimized.

Ready to streamline your SQL migrations?
Stop wrestling with SQL syntax and start leveraging the power of your data, wherever it resides. With the Gemini model, we can streamline your Databricks Spark SQL to BigQuery SQL translation process, making it faster, more reliable, and far less painful.
Dive in and accelerate your journey to cross-platform data success. Click on this link to get more details, and take it forward!

AI Summary and Description: Yes

Summary: The text discusses the challenges and solutions involved in migrating SQL workloads between Databricks SQL and Google BigQuery using the Google Gemini AI assistant. It highlights key differences in SQL dialects and emphasizes the importance of function mapping and validation in the migration process.

Detailed Description:
The provided text focuses on the increasingly common practice of migrating SQL workloads across different cloud platforms, specifically from Databricks to Google BigQuery. It emphasizes the challenges posed by the distinct syntax and functions between the two SQL engines and explains how the Google Gemini AI assistant aids in this complex task.

Key Points:
– **Context of Migration**:
– Businesses are diversifying cloud ecosystems, increasing the necessity to migrate SQL workloads between engines.
– The case study involves translating Databricks SQL queries to BigQuery SQL, which involves engaging with diverse SQL dialects.

– **Role of Google Gemini**:
– Gemini assists in transforming SQL queries, which can be complex due to differences in syntax, functions, and execution behavior.
– The text suggests that manual translation is prone to error and inefficiencies, making AI assistance valuable.

– **Data Type Mappings**:
– Provides a detailed comparison of various data types between Databricks and BigQuery, including integers, floating points, decimals, booleans, strings, and date/time types.

– **Function Translation Example**:
– Highlights specific examples where Databricks functions have equivalent functions in BigQuery, emphasizing the need for accurate function translation for complex queries like those involving geospatial functions.

– **Architecture for SQL Translation**:
– The process involves several components:
– Source SQL storage in Google Cloud Storage.
– A curated mapping guide for translating SQL functions.
– Use of Few-shot examples for training AI.
– Integration with Google’s Vertex AI RAG (Retrieval-Augmented Generation) to enhance translation accuracy.
– A validation layer utilizing BigQuery’s dry run mode to check for syntax correctness.

– **Lessons Learned**:
– The combination of RAG and Gemini is particularly effective for SQL translation.
– Building a comprehensive function mapping guide is essential for successful migration.
– Ensuring thorough validation of translated queries is vital for maintaining robustness and performance.

The insights from this process are crucial for professionals in cloud computing and data management, especially those involved in database migrations, AI utilization in data manipulation, and ensuring data integrity during cross-platform transitions. The text serves as a resourceful guideline for mitigating migration challenges and optimizing workflows in cloud ecosystems.

1 2 3 4 a accuracy Act ads advanced advanced AI age AGI AI ai model AI models All analytics and API app Arch architecture Aria art as assistance assistant at ated Augment augmented generation Auto Behavior Bi BigQuery building business by C capabilities case study cell challenge challenges CI CIA CleaR Cloud cloud computing cloud ecosystems cloud platform cloud platforms cloud storage co code Computing consistency Context correctness cost Costs cross cross-platform D data data integrity data management data manipulation data platform Data Type data types database database migration Databricks day de DeFi Delta Delta Lake Double e ecosystem ecosystems edge edge cases effective efficiency end engines error execution exp experience eXtended face fact fast faster file fine first floating for function g Gemini Gemini model Gen generation geo geospatial functions Go Google Google Cloud Google Cloud Storage Google Gemini grounding gs H h3 high Highlight HR http HTTPS improving in inefficiencies information insights integration integrity Intel io IRS issue ite J json k Key knowledge l language large led Li Link load logic low M made making man management manipulation migration migration challenges migration process mini mini model ML Mode model models N nation no NPU o of on one ons oost operation operational operational efficiency opt optimized ory oS oss out output over Paris per performance Pipeline platform platforms play point post Power pre precision pro process product products professionals prompt prompts ps Q quality queries R rag rate RCE re ready real red resource resources retrieval Retrieval-Augmented Generation return Ro robustness Role s safe scalable Scale schema server serverless SHA sharing shot side Sig Sim Simple size sizes small solutions source Spark specific sql SSE SSO STAR start storage stored study support syntax system systems T Tails Task ted text the Time times to Tor TP training Transform transition translation trie turn two type UI up US use user utilization V val Valid Validation version Vertex Vertex AI Ware Wi Wind workflow workflows workload workloads world world examples writing x yt z zone