Source URL: https://simonwillison.net/2025/Jun/24/anthropic-training/#atom-everything
Source: Simon Willison’s Weblog
Title: Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books
Feedly Summary: Anthropic wins a major fair use victory for AI — but it’s still in trouble for stealing books
Major USA legal news for the AI industry today. Judge William Alsup released a “summary judgement" (a legal decision that results in some parts of a case skipping a trial) in a lawsuit between five authors and Anthropic concerning the use of their books in training data.
The judgement itself is a very readable 32 page PDF, and contains all sorts of interesting behind-the-scenes details about how Anthropic trained their models.
The facts of the complaint go back to the very beginning of the company. Anthropic was founded by a group of ex-OpenAI researchers in February 2021. According to the judgement:
So, in January or February 2021, another Anthropic cofounder, Ben Mann, downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated. Anthropic’s next pirated acquisitions involved downloading distributed, reshared copies of other pirate libraries. In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated.
Books3 was also listed as part of the training data for Meta’s LLaMA training data!
Anthropic apparently used these sources of data to help build an internal "research library" of content that they then filtered and annotated and used in training runs.
Books turned out to be a very valuable component of the "data mix" to train strong models. By 2024 Anthropic had a new approach to collecting them: purchase and scan millions of print books!
To find a new way to get books, in February 2024, Anthropic hired the former head of partnerships for Google’s book-scanning project, Tom Turvey. He was tasked with obtaining "all the books in the world" while still avoiding as much "legal/practice/business slog" as possible (Opp. Exhs. 21, 27). […] Turvey and his team emailed major book distributors and retailers about bulk-purchasing their print copies for the AI firm’s "research library" (Opp. Exh. 22 at 145; Opp. Exh. 31 at -035589). Anthropic spent many millions of dollars to purchase millions of print books, often in used condition. Then, its service providers stripped the books from their bindings, cut their pages to size, and scanned the books into digital form — discarding the paper originals. Each print book resulted in a PDF copy containing images of the scanned pages with machine-readable text (including front and back cover scans for softcover books).
The summary judgement found that these scanned books did fall under fair use, since they were transformative versions of the works and were not shared outside of the company. The downloaded ebooks did not count as fair use, and it looks like those will be the subject of a forthcoming jury trial.
Here’s that section of the decision:
Before buying books for its central library, Anthropic downloaded over seven million pirated copies of books, paid nothing, and kept these pirated copies in its library even after deciding it would not use them to train its AI (at all or ever again). Authors argue Anthropic should have paid for these pirated library copies (e.g, Tr. 24–25, 65; Opp. 7, 12–13). This
order agrees.
The most important aspect of this case is the question of whether training an LLM on unlicensed data counts as "fair use". The judge found that it did. The argument for why takes up several pages of the document but this seems like a key point:
Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory,
each time they later draw upon it when writing new things in new ways would be unthinkable.
For centuries, we have read and re-read books. We have admired, memorized, and internalized
their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.
The judge who signed this summary judgement is an interesting character: William Haskell Alsup (yes, his middle name really is Haskell) presided over jury trials for Oracle America, Inc. v. Google, Inc in 2012 and 2016 where he famously used his hobbyist BASIC programming experience to challenge claims made by lawyers in the case.
Tags: law, ai, generative-ai, llms, anthropic, training-data, ai-ethics
AI Summary and Description: Yes
**Summary:** The text discusses a significant legal ruling involving Anthropic, an AI company, regarding the use of copyrighted books in training data, particularly focusing on the court’s decision about fair use. This legal precedent highlights crucial implications for AI training practices and ownership of data, offering insights relevant to AI security and compliance professionals.
**Detailed Description:** The case encapsulates several vital elements surrounding the use of copyrighted material in AI training, specifically in relation to fair use doctrines. Here are the key points to consider:
– **Legal Context:** A judge ruled on Anthropic’s use of copyrighted materials, particularly focusing on the distinction between pirated ebooks and legally acquired print books for training AI models.
– **Anthropic’s Practices:**
– Anthropic initially downloaded over seven million pirated ebooks from various sources, including Books3 and Library Genesis, which served as the foundation for its training data.
– Despite their initial reliance on pirated content, Anthropic later shifted to purchasing print books to build a legitimate research library, employing a process where the books were scanned and converted into digital format.
– **Significance of Fair Use:**
– The court found that the scanned versions of the books created by Anthropic could be considered transformative and thus fair use, while the pirated ebooks did not qualify for this protection.
– This ruling emphasizes the legal grey areas around using unlicensed content for machine learning, raising questions about the nature of copyright in the context of AI.
– **Implications for AI Ethics and Security:**
– The ruling sets important precedents regarding the acceptable practices in the AI industry when sourcing training data, affecting compliance and risk assessments that AI companies must navigate.
– The expectation for companies to ensure the legality of their data inflow underscores the necessity for robust data governance policies in AI development.
– **Wider Impact:**
– This decision could influence future court cases involving AI model training with copyrighted material, prompting organizations to rethink their data sourcing strategies vigorously.
– The ruling challenges established norms surrounding intellectual property and the evolving nature of content usage in AI, thereby shaping future ethical considerations in technology and AI development.
In summary, this ruling not only shapes the operational landscape for AI companies concerning data ethics and compliance but also serves as a cautionary tale for the use of copyrighted materials in AI model training, as professionals in the field must now pay closer attention to legal implications surrounding data acquisition.