Tomasz Tunguz: The SQL Gap

Aug 13, 2025

—

Source URL: https://www.tomtunguz.com/spider-2-benchmark-trends/
Source: Tomasz Tunguz
Title: The SQL Gap

Feedly Summary: GPT-5 achieves 94.6% accuracy on AIME 2025, suggesting near-human mathematical reasoning.
Yet ask it to query your database, and success rates plummet to the teens.
The Spider 2.0 benchmarks reveal a yawning gap in AI capabilities. Spider 2.0 is a comprehensive text-to-SQL benchmark that tests AI models’ ability to generate accurate SQL queries from natural language questions across real-world databases.
While large language models have conquered knowledge work in mathematics, coding, and reasoning, text-to-SQL remains stubbornly difficult.

The three Spider 2.0 benchmarks test real-world database querying across different environments. Spider 2.0-Snow uses Snowflake databases with 547 test examples, peaking at 59.05% accuracy.
Spider 2.0-Lite spans BigQuery, Snowflake, and SQLite with another 547 examples, reaching only 37.84%. Spider 2.0-DBT tests code generation against DuckDB with 68 examples, topping out at 39.71%.
This performance gap isn’t for lack of trying. Since November 2024, 56 submissions from 12 model families have competed on these benchmarks.
Claude, OpenAI, DeepSeek, and others have all pushed their models against these tests. Progress has been steady, from roughly 2% to about 60%, in the last nine months.
The puzzle deepens when you consider SQL’s constraints. SQL has a limited vocabulary compared to English, which has 600,000 words, or programming languages that have much broader syntaxes and libraries to know. Plus there’s plenty of SQL out there to train on.
If anything, this should be easier than the open-ended reasoning tasks where models now excel.
Yet even perfect SQL generation wouldn’t solve the real business challenge. Every company defines “revenue” differently.
Marketing measures customer acquisition cost by campaign spend, sales calculates it using account executive costs, and finance includes fully-loaded employee expenses. These semantic differences create confusion that technical accuracy can’t resolve.
The Spider 2.0 results point to a fundamental truth about data work. Technical proficiency in SQL syntax is just the entry point.
The real challenge lies in business context. Understanding what the data means, how different teams define metrics, and when edge cases matter. As I wrote about in Semantic Cultivators, the bridge between raw data and business meaning requires human judgment that current AI can’t replicate.

AI Summary and Description: Yes

Summary: The text discusses the capabilities and limitations of GPT-5 and other large language models in executing complex SQL queries, highlighting a significant performance gap in AI’s ability to translate natural language requests into SQL commands. This has implications for professionals in AI security and data governance as they seek to understand the technology’s current constraints and the necessity for human oversight in data interpretation.

Detailed Description:
The passage delves into the performance metrics of GPT-5 and its rivals in the context of the Spider 2.0 benchmarks, which assess how well AI models can generate SQL queries from natural language questions. Here are the key points presented:

– **GPT-5’s High Accuracy**: Achieves 94.6% accuracy in mathematical reasoning tasks, reflecting advanced AI capabilities in certain domains.

– **Poor Performance in SQL Generation**: The success rate for querying databases drops significantly when asked to convert natural language to SQL commands, with accuracy plummeting to the teens.

– **Spider 2.0 Benchmark Framework**:
– Evaluates AI models’ text-to-SQL capabilities across various real-world databases.
– The benchmarks include:
– **Spider 2.0-Snow**: Tests with Snowflake databases achieving peak accuracy of 59.05%.
– **Spider 2.0-Lite**: Includes a mix of databases (BigQuery, Snowflake, SQLite) with maximum accuracy of 37.84%.
– **Spider 2.0-DBT**: Tests coding against DuckDB with a top accuracy of 39.71%.

– **Competing Models**: Since November 2024, various AI models from companies like Claude, OpenAI, and DeepSeek have made 56 submissions, demonstrating incremental progress from approximately 2% to 60% over nine months.

– **Challenges with SQL’s Language Limitations**:
– SQL’s limited vocabulary comparison to English complicates the generation of accurate SQL queries.
– Despite the abundance of SQL data for AI training, the structural complexity and fixed nature of SQL create hurdles that more open-form reasoning tasks do not face.

– **Business Context Over Technical Proficiency**:
– Even accurate SQL generation cannot address the complexities of business metrics definitions.
– Different departments define key metrics (like revenue) in varied ways, leading to potential misinterpretations and challenges in aligning technical accuracy with business requirements.

– **Human Judgment is Crucial**: There is an emphasis on the necessity for human oversight in interpreting raw data, as understanding business nuances and contextual differences cannot currently be replicated by AI.

In conclusion, while advancements in AI, like GPT-5, showcase significant progress in computational tasks, the limitation in generating SQL queries highlights an essential gap that needs addressing, particularly regarding data interpretation and the understanding of business contexts. This insight is crucial for AI and data professionals in refining their approaches to leveraging AI technologies effectively while ensuring compliance and security in data operations.

1 2 2024 2025 24 3 4 5 7 a account accuracy acquisition advanced advanced AI advancement advancements age AGI AI AI capabilities ai model AI models AI security AI technologies All and anti app art as at ated benchmark benchmarks Bi BigQuery business business context by C capabilities challenge challenges CI CIA Claude co code code generation coding command companies complexity compliance computation computational tasks Context cost Costs cross Current Customer D data data governance data interpretation data operations data professionals data work database database querying databases de deep DeepSeek DeFi definition definitions demo domain domains duckdb e edge edge cases effective end Entry environment ERP Excel exp face finance fine fines for framework full g Gen generation Go governance GPT H high Highlight HR http HTTPS human human judgment human oversight implications in inter interpret io Iron ite J judgment Just k Key knowledge knowledge work l language language model language models large large language model large language models leading led Li libraries limitations Lite load M made man market marketing math mathematical mathematical reasoning mathematics matt max mean measures metrics mission Mode model models N natural language needs no o of on only ons open openai operation operations ops oS oss other out over oversight Paris per performance performance metrics point potential pre pro professionals programming programming language programming languages Progress ps puzzle Q queries question R rag rate RCE re real real-world data reasoning reasoning tasks red replicate Requirements revenue Ro RoT s sales sec security Semantic side Sig Snowflake source Spider sql sqlite SSE syntax T Task tasks taxes team Teams tech technical technologies technology ted test text the to Tor TP training trends truth UI under US use uth V val vocabulary Well Wi world x yt z