Tag: Common Crawl
-
Simon Willison’s Weblog: Comma v0.1 1T and 2T – 7B LLMs trained on openly licensed text
Source URL: https://simonwillison.net/2025/Jun/7/comma/#atom-everything Source: Simon Willison’s Weblog Title: Comma v0.1 1T and 2T – 7B LLMs trained on openly licensed text Feedly Summary: It’s been a long time coming, but we finally have some promising LLMs to try out which are trained entirely on openly licensed text! EleutherAI released the Pile four and a half…
-
Hacker News: Simple Explanation of LLMs
Source URL: https://blog.oedemis.io/understanding-llms-a-simple-guide-to-large-language-models Source: Hacker News Title: Simple Explanation of LLMs Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The text provides a comprehensive overview of Large Language Models (LLMs), highlighting their rapid adoption in AI, the foundational concepts behind their architecture, such as attention mechanisms and tokenization, and their implications for various fields.…
-
Hacker News: Classifying All of the Pdfs on the Internet
Source URL: https://snats.xyz/pages/articles/classifying_a_bunch_of_pdfs.html Source: Hacker News Title: Classifying All of the Pdfs on the Internet Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The text discusses classifying a massive dataset of PDFs obtained from the Common Crawl, particularly focusing on a customized approach utilizing large language models (LLMs), embeddings, and traditional machine learning techniques…