Simon Willison’s Weblog: AI for data engineers with Simon Willison

Aug 11, 2025

—

Source URL: https://simonwillison.net/2025/Aug/11/ai-for-data-engineers/#atom-everything
Source: Simon Willison’s Weblog
Title: AI for data engineers with Simon Willison

Feedly Summary: AI for data engineers with Simon Willison
I recorded an episode last week with Claire Giordano for the Talking Postgres podcast. The topic was “AI for data engineers" but we ended up covering an enjoyable range of different topics.

How I got started programming with a Commodore 64 – the tape drive for which inspired the name Datasette
Selfish motivations for TILs (force me to write up my notes) and open source (help me never have to solve the same problem twice)
LLMs have been good at SQL for a couple of years now. Here’s how I used them for a complex PostgreSQL query that extracted alt text from my blog’s images using regular expressions
Structured data extraction as the most economically valuable application of LLMs for data work
2025 has been the year of tool calling a loop ("agentic" if you like)
Thoughts on running MCPs securely – read-only database access, think about sandboxes, use PostgreSQL permissions, watch out for the lethal trifecta
Jargon guide: Agents, MCP, RAG, Tokens
How to get started learning to prompt: play with the models and "bring AI to the table" even for tasks that you don’t think it can handle
"It’s always a good day if you see a pelican"

Tags: postgresql, ai, generative-ai, llms, podcast-appearances

AI Summary and Description: Yes

Summary: The text discusses various aspects of using AI in data engineering, focusing on its application with SQL, the use of large language models (LLMs), and secure practices in running machine learning processes. It highlights how LLMs can effectively assist in data extraction tasks, emphasizing both practical applications and security considerations in handling data.

Detailed Description: The content centers around the intersection of AI and data engineering, particularly the use of LLMs for querying and data extraction with PostgreSQL. Here are the major points elaborated upon:

– **Introduction to the Podcast Episode**: The speaker reflected on their experiences merging traditional data engineering with AI technologies, particularly in the context of a podcast episode titled “AI for data engineers.”

– **Programming Journey**: A brief history of the speaker’s introduction to programming, referencing a Commodore 64 and its impact on their development experience.

– **Use of LLMs for SQL Queries**:
– LLMs (Large Language Models) have been effective in generating and understanding SQL for tasks over the past couple of years.
– The speaker provided an example of using LLMs to extract alt text from images in their blog through a complex PostgreSQL query utilizing regular expressions.

– **Valuable Applications of LLMs**:
– **Structured Data Extraction**: The text identifies structured data extraction as a significant use case for LLMs within data engineering, highlighting its economic significance.

– **Secure Practices for Machine Learning**:
– The discussion touches on important security practices when running machine learning processes (MCPs) and databases:
– Emphasis on read-only database access to prevent unauthorized data changes.
– Utilization of sandboxes to ensure a secure testing environment.
– Attention to permissions within PostgreSQL to maintain security while allowing necessary access.

– **Jargon and Terminology**: A brief mention of jargon related to AI and data engineering, including terms like Agents, MCP (Machine Learning Controlled Processes), RAG (Retrieval-Augmented Generation), and Tokens.

– **Learning and Engagement**: Advice on learning to prompt LLMs, encouraging experimentation and interaction with AI tools even for less conventional tasks.

– **Closing Thoughts**: A light-hearted conclusion with a personal touch, indicating the speaker’s positive outlook on daily experiences.

Overall, the discussion merges technological insights with practical advice for data engineers looking to leverage AI securely and effectively, making it relevant for professionals in the fields of AI security, information security, and data management.

.NET 1 2 2025 4 5 a access Act age agent agentic agents AGI AI AI security AI technologies AI tool AI tools air Aker All alt and app appearances Application applications Argo art as at ated Augment augmented generation Box C calling centers CI co content Context control D data data engineering data engineers data extraction data management data work database databases dataset datasette day de development development experience drive e effective ELF end engagement Engineer engineering engineers environment event exp experience experimentation extraction for g Gen generation generative Go gs H handling high Highlight HR http HTTPS image in information information security insights inter interaction io Iron ite J jargon k l Labor language language model language models large large language model large language models Large Language Models (LLMs) learning learning process led lethal lethal trifecta Li llm llms lm logic loop low M mac machine Machine Learning making man management mcp mission Mode model models motivations my N no nomic notes o of on on learning only ons open OPM ory oS out Outlook over pelican per permissions play podcast point post Postgre Postgres PostgreSQL practical applications practices pre pro problem process processes professionals programming prompt ps Q queries R rag rate RCE re record red retrieval Retrieval-Augmented Generation Ro s sam sandbox sec secure secure practices security security considerations security practices self side Sig Sim source sql SSE STAR start structured structured data structured data extraction T Tags: tape Task tasks tech technological technologies ted terminology test Testing text the Thought to token tokens tool tool calling tools Tor TP trie trifecta UI under up US use uth utilization V val web Wi x yt z