Simon Willison’s Weblog: Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books

Jun 24, 2025

—

Source URL: https://simonwillison.net/2025/Jun/24/anthropic-training/#atom-everything
Source: Simon Willison’s Weblog
Title: Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books

Feedly Summary: Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books
Major USA legal news for the AI industry today. Judge William Alsup released a “summary judgement" (a legal decision that results in some parts of a case skipping a trial) in a lawsuit between five authors and Anthropic concerning the use of their books in training data.
The judgement itself is a very readable 32 page PDF, and contains all sorts of interesting behind-the-scenes details about how Anthropic trained their models.
The facts of the complaint go back to the very beginning of the company. Anthropic was founded by a group of ex-OpenAI researchers in February 2021. According to the judgement:

So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated. Anthropic’s next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.

Books3 was also listed as part of the training data for Meta’s LLaMA training data!
Anthropic apparently used these sources of data to help build an internal "research library" of content that they then filtered and annotated and used in training runs.
Books turned out to be a very valuable component of the "data mix" to train strong models. By 2024 Anthropic had a new approach to collecting them: purchase and scan millions of print books!

To find a new way to get books, in February 2024, Anthropic hired the former head of partnerships for Google’s book-scanning project, Tom Turvey. He was tasked with obtaining "all the books in the world" while still avoiding as much "legal/practice/business slog" as possible (Opp. Exhs. 21, 27). […] Turvey and his team emailed major book distributors and retailers about bulk-purchasing their print copies for the AI firm’s "research library" (Opp. Exh. 22 at 145; Opp. Exh. 31 at -035589). Anthropic spent many millions of dollars to purchase millions of print books, often in used condition. Then, its service providers stripped the books from their bindings, cut their pages to size, and scanned the books into digital form — discarding the paper originals. Each print book resulted in a PDF copy containing images of the scanned pages with machine-readable text (including front and back cover scans for softcover books).

The summary judgement found that these scanned books did fall under fair use, since they were transformative versions of the works and were not shared outside of the company. The downloaded ebooks did not count as fair use, and it looks like those will be the subject of a forthcoming jury trial.
Here’s that section of the decision:

Before buying books for its central library, Anthropic downloaded over seven million pirated copies of books, paid nothing, and kept these pirated copies in its library even after deciding it would not use them to train its AI (at all or ever again). Authors argue Anthropic should have paid for these pirated library copies (e.g, Tr. 24–25, 65; Opp. 7, 12–13). This
order agrees.

The most important aspect of this case is the question of whether training an LLM on unlicensed data counts as "fair use". The judge found that it did. The argument for why takes up several pages of the document but this seems like a key point:

Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized
their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

The judge who signed this summary judgement is an interesting character: William Haskell Alsup (yes, his middle name really is Haskell) presided over jury trials for Oracle America, Inc. v. Google, Inc in 2012 and 2016 where he famously used his hobbyist BASIC programming experience to challenge claims made by lawyers in the case.
Tags: law, ai, generative-ai, llms, anthropic, training-data, ai-ethics

AI Summary and Description: Yes

**Summary:** The text discusses a significant legal ruling involving Anthropic, an AI company, regarding the use of copyrighted books in training data, particularly focusing on the court’s decision about fair use. This legal precedent highlights crucial implications for AI training practices and ownership of data, offering insights relevant to AI security and compliance professionals.

**Detailed Description:** The case encapsulates several vital elements surrounding the use of copyrighted material in AI training, specifically in relation to fair use doctrines. Here are the key points to consider:

– **Legal Context:** A judge ruled on Anthropic’s use of copyrighted materials, particularly focusing on the distinction between pirated ebooks and legally acquired print books for training AI models.

– **Anthropic’s Practices:**
– Anthropic initially downloaded over seven million pirated ebooks from various sources, including Books3 and Library Genesis, which served as the foundation for its training data.
– Despite their initial reliance on pirated content, Anthropic later shifted to purchasing print books to build a legitimate research library, employing a process where the books were scanned and converted into digital format.

– **Significance of Fair Use:**
– The court found that the scanned versions of the books created by Anthropic could be considered transformative and thus fair use, while the pirated ebooks did not qualify for this protection.
– This ruling emphasizes the legal grey areas around using unlicensed content for machine learning, raising questions about the nature of copyright in the context of AI.

– **Implications for AI Ethics and Security:**
– The ruling sets important precedents regarding the acceptable practices in the AI industry when sourcing training data, affecting compliance and risk assessments that AI companies must navigate.
– The expectation for companies to ensure the legality of their data inflow underscores the necessity for robust data governance policies in AI development.

– **Wider Impact:**
– This decision could influence future court cases involving AI model training with copyrighted material, prompting organizations to rethink their data sourcing strategies vigorously.
– The ruling challenges established norms surrounding intellectual property and the evolving nature of content usage in AI, thereby shaping future ethical considerations in technology and AI development.

In summary, this ruling not only shapes the operational landscape for AI companies concerning data ethics and compliance but also serves as a cautionary tale for the use of copyrighted materials in AI model training, as professionals in the field must now pay closer attention to legal implications surrounding data acquisition.

.NET 01 1 2 2024 2025 24 3 4 5 7 a acquisition acquisitions Act ads after AI AI development AI Ethics ai model AI models AI security air and and Risk annotated Anthropic anti API app Arch art as assessment assessments ated authors AWS Bi business by C C programming caution CERN challenges CI CIA co Col companies compliance compliance professionals Condi content Context copyright core court D data data acquisition data ethics data governance data sourcing data sourcing strategies day de decision development digital document e E 3 ELF email Entra ethical ethical considerations Ethics exp experience fact fair use fair use doctrine first for front FTC future g Gen generative git Go Google governance governance policies Group gs H hands Haskell high Highlight hire HR http HTTPS IAM image implications in industry Influence insights Instance Intel intellectual Intellectual Property inter intern io IRS ite J k Key l land law lawsuit learning least led Legal legal implications legal precedent legal ruling legality Li libraries library license licensed data llama llm llms lm low M mac machine Machine Learning machine-readable made man Materials memory Meta mid middle Mir Mode model model training models N NCA new news next no nothing o of off on one only open openai operation OPM Oracle organization organizations ory oS other out over ownership paper partnership partnerships pdf point policies practices pre pro problem process professionals programming project prompt Prompting property protection ps Py Q question R raising rate RCE real recall red release research researchers retail retailers right Risk Risk Assessment risk assessments Ro RoT s S3 scanning search sec security security and compliance self service service providers SHA shift side Sig Sim size solutions source sourcing specific SSE stealing strategies T Tags: Tails Task team tech technology ted text the Time to Tor TP trained training training data training practices Transform transformative trial turn two UI under up US usage use uth V val version web Wi world writing x yt