Simon Willison’s Weblog: Releasing the largest multilingual open pretraining dataset

Source URL: https://simonwillison.net/2024/Nov/14/releasing-the-largest-multilingual-open-pretraining-dataset/#atom-everything
Source: Simon Willison’s Weblog
Title: Releasing the largest multilingual open pretraining dataset

Feedly Summary: Releasing the largest multilingual open pretraining dataset
Common Corpus is a new “open and permissible licensed text dataset, comprising over 2 trillion tokens (2,003,039,184,047 tokens)" released by French AI Lab PleIAs.
This appears to be the largest available corpus of openly licensed training data:

926,541,096,243 tokens of public domain books, newspapers, and Wikisource content
387,965,738,992 tokens of government financial and legal documents
334,658,896,533 tokens of open source code from GitHub
221,798,136,564 tokens of academic content from open science repositories
132,075,315,715 tokens from Wikipedia, YouTube Commons, StackExchange and other permissively licensed web sources

It’s majority English but has significant portions in French and German, and some representation for Latin, Dutch, Italian, Polish, Greek and Portuguese.
I can’t wait to try some LLMs trained exclusively on this data. Maybe we will finally get a GPT-4 class model that isn’t trained on unlicensed copyrighted data.
Via @dorialexander
Tags: ethics, generative-ai, training-data, ai, llms

AI Summary and Description: Yes

Summary: The release of the Common Corpus dataset by PleIAs represents a significant advancement in the availability of multilingual pretraining data for AI models. The dataset, which includes over 2 trillion tokens from a variety of sources, highlights a commitment to open licensing and ethical training data use, addressing concerns about using unlicensed copyrighted materials.

Detailed Description: The announcement of the Common Corpus dataset has major implications for the development of AI, particularly in the realm of language models (LLMs). As organizations and developers strive for greater ethical compliance in AI training practices, this dataset provides an expansive resource while aiming to reduce reliance on unlicensed data.

Key points from the release include:

– **Size and Scope**: The Common Corpus offers more than 2 trillion tokens, making it the largest openly licensed text dataset available.

– **Diverse Sources**:
– 926 billion tokens from public domain literature (books, newspapers, etc.)
– 388 billion tokens from government financial and legal documents
– 335 billion tokens from open source code available on GitHub
– 222 billion tokens from academic research in open science repositories
– 132 billion tokens from generally permissive web sources (Wikipedia, YouTube Commons, StackExchange)

– **Multilingual Focus**: While the majority of the dataset is in English, it also includes substantial amounts of text in French and German, and to a lesser extent, Latin, Dutch, Italian, Polish, Greek, and Portuguese.

– **Ethical Considerations**: This development is significant for AI ethics as it aims to produce models that are not solely reliant on unlicensed copyrighted data, and could facilitate the creation of advanced generative AI systems that adhere more closely to ethical guidelines.

– **Potential for Future Models**: The dataset opens avenues for training large language models (LLMs) that could rival existing state-of-the-art models like GPT-4 while maintaining compliance with standards around data licensing.

Overall, the release of the Common Corpus dataset is expected to enhance the landscape of AI training data, promise advancements in multilingual AI system capabilities, and push for ethical AI development in line with modern compliance and governance standards.