Source URL: https://simonwillison.net/2025/Jun/7/comma/#atom-everything
Source: Simon Willison’s Weblog
Title: Comma v0.1 1T and 2T – 7B LLMs trained on openly licensed text
Feedly Summary: It’s been a long time coming, but we finally have some promising LLMs to try out which are trained entirely on openly licensed text!
EleutherAI released the Pile four and a half years ago: “an 800GB dataset of diverse text for language modeling". It’s been used as the basis for many LLMs since then, but much of the data in it came from Common Crawl – a crawl of the public web which mostly ignored the licenses of the data it was collecting.
The Common Pile v0.1 is EleutherAI’s successor to the original Pile, in collaboration with a large group of other organizations which they describe as "a meticulously curating a 8 TB corpus of openly licensed and public domain text for training large language models".
The dataset is exciting, but on top of that they’ve released two new LLMs that have been trained on it: Comma v0.1 1T and 2T, both with 7 billion parameters, the first trained on 1 trillion tokens and the second on 2 trillion tokens.
These are available on Hugging Face as common-pile/comma-v0.1-1t and common-pile/comma-v0.1-2t.
EleutherAI claim that these new models perform "comparably to leading models trained in the same regime on unlicensed data". I decided to try them out myself.
The models are currently only available as .safetensors files, which I’ve found difficult to run on macOS in the past. I decided to see if I could convert them to MLX format which I know how to run on my Mac.
MLX is still a very new format, but Claude 4 Sonnet has a training cutoff date of March 2025 so I crossed my fingers and hoped it would be able to help me out. It did exactly that! I ran the following command to convert the 2T model to run using MLX:
uv run –python 3.12 \
–with mlx-lm \
python -m mlx_lm convert \
–hf-path common-pile/comma-v0.1-2t \
–mlx-path ./comma-v0.1-2t-mlx
I uploaded the converted model to Hugging Face as simonw/comma-v0.1-2t-mlx.
Now that it’s on the Hub here’s how to try it out (using uv run):
uv run –python 3.12 \
–with mlx-lm \
mlx_lm.generate \
–model simonw/comma-v0.1-2t-mlx \
–prompt ‘Facts about pelicans:’
The first time you run this it will download 13GB of files to ~/.cache/huggingface/hub/models–simonw–comma-v0.1-2t-mlx.
Here’s what I got back:
1. They are the largest of the water birds. 2. They are found in all parts of the world. 3. They are very good swimmers. 4. They are very good divers. 5. They are very good flyers. 6. They are very good hunters. 7. They are very good eaters. 8. They are very good parents. 9. They are very good friends. 10.
The big limitation of this model right now is that it’s a raw base model – it hasn’t been instruction-tuned or set up for chat.
This means you have to prefix-prompt it, like in the GPT-3 days. You need to give it a sentence for it to complete.
This makes it a lot harder to evaluate than the instruction-tuned models that I’ve become used to over the past few years!
I’m hoping someone releases a chat-tuned version of this model soon. The challenge there will be keeping to the openly licensed training data, since most of the fine-tuning datasets out there for this are themselves derived from models that were trained on unlicensed data.
Tags: llms, ai-ethics, llm-release, generative-ai, training-data, ai, mlx
AI Summary and Description: Yes
Summary: The text introduces EleutherAI’s new openly licensed dataset, Common Pile v0.1, and two large language models (LLMs) trained with it, highlighting their competitive performance compared to models trained on unlicensed data. It also discusses the technical challenges of using these models, particularly regarding their current base nature before further instruction tuning.
Detailed Description:
– **Introduction of Common Pile v0.1**: EleutherAI’s effort to curate a dataset that avoids unlicensed text for training models.
– Successor to the original Pile, focusing on openly licensed and public domain texts.
– A significant collaborative effort bringing together various organizations to compile an 8 TB corpus.
– **Release of New LLMs**: The introduction of Comma v0.1 models (1T and 2T) with large parameters (7 billion).
– Trained on 1 trillion and 2 trillion tokens, making them robust in terms of language generation.
– Available on Hugging Face, allowing community access to cutting-edge AI technology.
– **Performance Claims**: EleutherAI asserts that these models perform comparably to top-tier models trained on non-licensed data, opening discussions around the ethics of dataset curation and model performance.
– **Practical Use**: The text outlines user experiences with model conversion and interaction.
– Challenges faced in using .safetensors files on macOS and the successful conversion to MLX format.
– Step-by-step instructions are provided for running the models, emphasizing accessibility and community-driven AI experimentation.
– **Limitations**: An acknowledgment that the current models lack instruction tuning, which can complicate their evaluation and usability compared to more advanced models that facilitate user interaction.
– Highlights a need for a chat-tuned version of these models to enhance usability, underlining the ongoing challenges in the AI development landscape related to dataset licensing issues.
**Key Insights for Professionals**:
– The development reflects a significant shift towards ethical data practices in AI, marking a positive move for regulatory compliance.
– The text exemplifies the ongoing challenges in AI model training and deployment, particularly around licensing and usability.
– The technical details shared provide actionable insights for engineers and data scientists looking to integrate LLMs into applications or to further explore AI training methodologies.
– As licensing becomes increasingly crucial in AI and ML, these developments prompt a broader conversation on compliance, governance, and ethical AI development practices.