Simon Willison’s Weblog: How OpenElections Uses LLMs

Source URL: https://simonwillison.net/2025/Jun/19/how-openelections-uses-llms/#atom-everything
Source: Simon Willison’s Weblog
Title: How OpenElections Uses LLMs

Feedly Summary: How OpenElections Uses LLMs
The OpenElections project collects detailed election data for the USA, all the way down to the precinct level. This is a surprisingly hard problem: while county and state-level results are widely available, precinct-level results are published in thousands of different ad-hoc ways and rarely aggregated once the election result has been announced.
A lot of those precinct results are published as image-filled PDFs.
Derek Willis has recently started leaning on Gemini to help parse those PDFs into CSV data:

For parsing image PDFs into CSV files, Google’s Gemini is my model of choice, for two main reasons. First, the results are usually very, very accurate (with a few caveats I’ll detail below), and second, Gemini’s large context window means it’s possible to work with PDF files that can be multiple MBs in size.

Is this piece he shares the process and prompts for a real-world expert level data entry project, assisted by Gemini.
This example from Limestone County, Texas is a great illustration of how tricky this problem can get. Getting traditional OCR software to correctly interpret multi-column layouts like this always requires some level of manual intervention:

Derek’s prompt against Gemini 2.5 Pro throws in an example, some special instructions and a note about the two column format:

Produce a CSV file from the attached PDF based on this example:
county,precinct,office,district,party,candidate,votes,absentee,early_voting,election_day
Limestone,Precinct 101,Registered Voters,,,,1858,,,
Limestone,Precinct 101,Ballots Cast,,,,1160,,,
Limestone,Precinct 101,President,,REP,Donald J. Trump,879,,,
Limestone,Precinct 101,President,,DEM,Kamala D. Harris,271,,,
Limestone,Precinct 101,President,,LIB,Chase Oliver,1,,,
Limestone,Precinct 101,President,,GRN,Jill Stein,4,,,
Limestone,Precinct 101,President,,,Write-ins,1,,,
Skip Write-ins with candidate names and rows with “Cast Votes", "Not Assigned", "Rejected write-in votes", "Unresolved write-in votes" or "Contest Totals". Do not extract any values that end in "%"
Use the following offices:
President/Vice President -> President
United States Senator -> U.S. Senate
US Representative -> U.S. House
State Senator -> State Senate
Quote all office and candidate values. The results are split into two columns on each page; parse the left column first and then the right column.

A spot-check and a few manual tweaks and the result against a 42 page PDF was exactly what was needed.
How about something harder? The results for Cameron County came as more than 600 pages and looked like this – note the hole-punch holes that obscure some of the text!

This file had to be split into chunks of 100 pages each, and the entire process still took a full hour of work – but the resulting table matched up with the official vote totals.
I love how realistic this example is. AI data entry like this isn’t a silver bullet – there’s still a bunch of work needed to verify the results and creative thinking needed to work through limitations – but it represents a very real improvement in how small teams can take on projects of this scale.

In the six weeks since we started working on Texas precinct results, we’ve been able to convert them for more than half of the state’s 254 counties, including many image PDFs like the ones on display here. That pace simply wouldn’t be possible with data entry or traditional OCR software.

Via Hacker News
Tags: data-journalism, derek-willis, ocr, ai, generative-ai, llms, gemini, vision-llms, structured-extraction

AI Summary and Description: Yes

**Summary:** The text discusses the innovative use of large language models (LLMs), specifically Google’s Gemini, to transform complex election data from image PDFs into structured CSV files. This application highlights the power of AI in data extraction and processing, revealing both its capabilities and limitations while showcasing a significant advancement in data entry for election results.

**Detailed Description:**
The OpenElections project aims to collect and standardize detailed election results across the United States, focusing specifically at the precinct level where data is often poorly formatted and difficult to aggregate. Derek Willis shares how he employs Google’s Gemini model to streamline the historically cumbersome task of converting image PDFs into usable CSV data. Here are the major points from the text:

– **Challenge of Election Data Processing:**
– Election results are typically published in various formats, predominantly ad-hoc image-filled PDFs, complicating data aggregation efforts.
– Traditional Optical Character Recognition (OCR) tools struggle with complex layouts, especially in multi-column formats typically found in precinct election results.

– **Gemini’s Advantages:**
– High accuracy in parsing data from PDFs, with significant capability to handle large files due to its extended context window.
– The model allows users to input detailed prompts, which guide the data extraction process more effectively than standard methods.

– **Real-World Application:**
– Willis provides a practical example using data from Limestone County, demonstrating the structure of CSVs required and the intricacies involved in the prompts used.
– He presents detailed instructions regarding which data to extract and how to handle various edge cases, highlighting the importance of human oversight in the process.

– **Scalability and Efficiency:**
– In a short timeframe, the team has successfully converted election data for over half of Texas’s counties, showcasing the model’s scalability and the improvements over traditional data entry methods.
– The example emphasizes that while AI greatly enhances efficiency, it is not a complete solution; manual adjustments and expert verification remain crucial for accuracy.

– **Conclusion:**
– The use of AI for data entry in complex scenarios like election results is a significant step forward. While it doesn’t eliminate all the labor involved, it offers a valuable tool for small teams, enabling them to tackle larger projects that would be unmanageable with conventional systems.

**Implications for Security and Compliance Professionals:**
– Understanding the nuances of AI implementations in data extraction tasks, especially those involving sensitive information like election data, is crucial.
– Implementing security measures around the data processing and storage to ensure compliance with privacy regulations.
– The discussion serves as a reminder of the ongoing need for human-in-the-loop frameworks to prevent errors and verify output in contexts where data integrity is critical.