Source URL: https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-everything
Source: Simon Willison’s Weblog
Title: Cutting-edge web scraping techniques at NICAR
Feedly Summary: Cutting-edge web scraping techniques at NICAR
Here’s the handout for a workshop I presented this morning at NICAR 2025 on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs.
For workshops like this I like to work off an extremely detailed handout, so that people can move at their own pace or catch up later if they didn’t get everything done.
The workshop consisted of four parts:
Building a Git scraper – an automated scraper in GitHub Actions that records changes to a resource over time
Using in-browser JavaScript and then shot-scraper to extract useful information
Using LLM with both OpenAI and Google Gemini to extract structured data from unstructured websites
Video scraping using Google AI Studio
I released several new tools in preparation for this workshop (I call this “NICAR Driven Development"):
git-scraper-template template repository for quickly setting up new Git scrapers, which I wrote about here
LLM schemas, finally adding structured schema support to my LLM tool
shot-scraper har for archiving pages as HTML Archive files – though I cut this from the workshop for time
I also came up with a fun way to distribute API keys for workshop participants: I had Claude build me a web page where I can create an encrypted message with a passphrase, then share a URL to that page with users and give them the passphrase to unlock the encrypted message. You can try that at tools.simonwillison.net/encrypt – or use this link and enter the passphrase "demo":
Tags: shot-scraper, gemini, nicar, openai, git-scraping, ai, speaking, llms, scraping, generative-ai, claude-artifacts, ai-assisted-programming, claude
AI Summary and Description: Yes
Summary: The text discusses a workshop focused on advanced web scraping techniques, especially highlighting the use of Large Language Models (LLMs) for data extraction. It emphasizes the innovative tools and methods that have emerged in this space, making significant contributions to AI and scraping technologies relevant to professionals in AI security and infrastructure.
Detailed Description: The text outlines a workshop presented at NICAR 2025, emphasizing the recent advancements in web scraping techniques enabled by LLMs. The workshop is segmented into distinct parts, each showcasing different scraping methodologies and tools that leverage modern AI technologies. Here are the key points:
– **Workshop Structure**: The workshop was divided into four main segments, highlighting different aspects of web scraping:
– **Building a Git Scraper**: An automated scraper integrated with GitHub Actions that tracks and records changes to resources over time.
– **JavaScript and shot-scraper**: Utilizing in-browser JavaScript along with the shot-scraper tool to gather useful data from websites.
– **Leveraging LLM**: Employing OpenAI and Google Gemini to convert unstructured website information into structured data.
– **Video Scraping**: Utilizing Google AI Studio for extracting data from video content.
– **New Tools**: The presenter introduced several tools under the concept of “NICAR Driven Development”:
– **git-scraper-template**: A Git repository template created for rapid setup of new Git scrapers, highlighted for its utility in the workshop.
– **LLM Schemas**: A new feature that adds structured schema support to the LLM tool, facilitating better data organization.
– **shot-scraper**: A tool designed to archive pages as HTML Archive files, though this was ultimately cut from the main workshop agenda due to time constraints.
– **Innovative API Key Distribution**: A unique method for sharing API keys was presented, where a web page created by Claude encrypts a message using a passphrase. This innovative approach emphasizes security and accessibility for workshop participants.
Overall, the content showcases not only the practical application of AI and LLMs in enhancing web scraping capabilities but also engages with various aspects of security through encrypted message sharing, making it highly relevant for professionals in AI security, compliance, and infrastructure.