Simon Willison’s Weblog: New Pleias 1.0 LLMs trained exclusively on openly licensed data

Source URL: https://simonwillison.net/2024/Dec/5/pleias-llms/#atom-everything
Source: Simon Willison’s Weblog
Title: New Pleias 1.0 LLMs trained exclusively on openly licensed data

Feedly Summary: New Pleias 1.0 LLMs trained exclusively on openly licensed data
I wrote about the Common Corpus public domain dataset back in March. Now Pleias, the team behind Common Corpus, have released the first family of models that are:

[…] trained exclusively on open data, meaning data that are either non-copyrighted or are published under a permissible license”.

There’s a lot to absorb here. The Pleias 1.0 family comes in three base model sizes: 350M, 1.2B and 3B. They’ve also released two models specialized for multi-lingual RAG: Pleias-Pico (350M) and Pleias-Nano (1.2B).
No GGUFs yet so I’ve not managed to run the models and try them out myself.
I’m looking forward to seeing benchmarks from other sources, but Pleias ran their own custom multilingual RAG benchmark which had their Pleias-nano-1.2B-RAG model come in between Llama-3.2-Instruct-3B and Llama-3.2-Instruct-8B.
The 350M and 3B models were trained on the French government’s Jean Zay supercomputer. Pleias are proud of their CO2 footprint for training the models – 0.5, 4 and 16 tCO2eq for the three models respectively, which they compare to Llama 3.2,s reported figure of 133 tCO2eq.
Via @Dorialexander
Tags: open-source, generative-ai, training-data, ai, llms

AI Summary and Description: Yes

Summary: The text discusses the release of the Pleias 1.0 family of LLMs (Large Language Models) trained exclusively on openly licensed data. This development is significant as it not only emphasizes the importance of using non-copyrighted data in AI training but also highlights efforts to reduce the carbon footprint associated with ML model training.

Detailed Description:

The Pleias team has introduced the Pleias 1.0 models, specifically trained on openly licensed datasets. This initiative represents a shift towards using non-copyrighted or permissible license data in the development of AI models, which can have various implications for the fields of AI, information security, and compliance.

Key points include:

– **Openly Licensed Data**: Utilization of public domain datasets or data published under permissive licenses aims to ensure that AI models are built legally and ethically, reducing concerns about copyright infringement.

– **Model Family**: The Pleias 1.0 suite includes three base model sizes—350M, 1.2B, and 3B parameters, along with two specialized multilingual retrieval-augmented generation (RAG) models: Pleias-Pico and Pleias-Nano.

– **Benchmarking Performance**: Pleias conducted their own multilingual RAG benchmarks, showing that their Pleias-nano-1.2B-RAG model performs competitively compared to Llama models.

– **Environmental Considerations**: The models were trained on the Jean Zay supercomputer in France, and Pleias claims to maintain a significantly lower carbon footprint (measured in CO2 equivalent) for training these models, promoting responsible AI development.

– **Future Comparisons**: While the author hasn’t run the models personally, they express interest in seeing external benchmarks, reflecting a common practice within the AI community for evaluating model performance.

The advancements represented by the Pleias 1.0 models are significant for professionals involved in AI development, as they pivot towards ethical data use and sustainability, setting a potential benchmark for future AI training methodologies while enhancing compliance with various regulations surrounding data usage and environmental responsibility.