Tag: web scraping

  • Slashdot: AI Crawlers Haven’t Learned To Play Nice With Websites

    Source URL: https://slashdot.org/story/25/03/19/1027251/ai-crawlers-havent-learned-to-play-nice-with-websites?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: AI Crawlers Haven’t Learned To Play Nice With Websites Feedly Summary: AI Summary and Description: Yes Summary: SourceHut is experiencing service disruptions due to aggressive web crawling by AI companies collecting data for training large language models (LLMs). They have implemented mitigations, including blocking certain cloud providers due to…

  • Hacker News: AI crawlers haven’t learned to play nice with websites

    Source URL: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/ Source: Hacker News Title: AI crawlers haven’t learned to play nice with websites Feedly Summary: Comments AI Summary and Description: Yes Summary: SourceHut reports that excessive crawling by AI companies’ web crawlers is disrupting its services. These crawlers, primarily for training large language models (LLMs), have compelled SourceHut to implement several mitigations,…

  • Slashdot: BlueSky Proposes ‘New Standard’ for When Scraping Data for AI Training

    Source URL: https://tech.slashdot.org/story/25/03/17/0434237/bluesky-proposes-new-standard-for-when-scraping-data-for-ai-training?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: BlueSky Proposes ‘New Standard’ for When Scraping Data for AI Training Feedly Summary: AI Summary and Description: Yes Summary: The article discusses Bluesky’s proposal for user data consent regarding scraping for generative AI training and archiving. This initiative signifies a potential shift in how user data privacy is managed…

  • The Register: We did not have Brave clashing with Rupert Murdoch on our 2025 bingo card, but there it is

    Source URL: https://www.theregister.com/2025/03/13/brave_news_corp_content/ Source: The Register Title: We did not have Brave clashing with Rupert Murdoch on our 2025 bingo card, but there it is Feedly Summary: Indie browser maker asks judge for legal shield against copyright threats over AI summaries Brave has gone to court to head off potential legal action from News Corp…

  • Simon Willison’s Weblog: What’s new in the world of LLMs, for NICAR 2025

    Source URL: https://simonwillison.net/2025/Mar/8/nicar-llms/ Source: Simon Willison’s Weblog Title: What’s new in the world of LLMs, for NICAR 2025 Feedly Summary: I presented two sessions at the NICAR 2025 data journalism conference this year. The first was this one based on my review of LLMs in 2024, extended by several months to cover everything that’s happened…

  • Simon Willison’s Weblog: Cutting-edge web scraping techniques at NICAR

    Source URL: https://simonwillison.net/2025/Mar/8/cutting-edge-web-scraping/#atom-everything Source: Simon Willison’s Weblog Title: Cutting-edge web scraping techniques at NICAR Feedly Summary: Cutting-edge web scraping techniques at NICAR Here’s the handout for a workshop I presented this morning at NICAR 2025 on web scraping, focusing on lesser know tips and tricks that became possible only with recent developments in LLMs. For…

  • Simon Willison’s Weblog: Structured data extraction from unstructured content using LLM schemas

    Source URL: https://simonwillison.net/2025/Feb/28/llm-schemas/#atom-everything Source: Simon Willison’s Weblog Title: Structured data extraction from unstructured content using LLM schemas Feedly Summary: LLM 0.23 is out today, and the signature feature is support for schemas – a new way of providing structured output from a model that matches a specification provided by the user. I’ve also upgraded both…

  • Hacker News: AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt

    Source URL: https://arstechnica.com/tech-policy/2025/01/ai-haters-build-tarpits-to-trap-and-trick-ai-scrapers-that-ignore-robots-txt/ Source: Hacker News Title: AI haters build tarpits to trap and trick AI scrapers that ignore robots.txt Feedly Summary: Comments AI Summary and Description: Yes Summary: The text discusses the creation of a new malware named Nepenthes, designed by a software developer to combat AI web crawlers that ignore “no scraping” directives…

  • Slashdot: Developer Creates Infinite Maze That Traps AI Training Bots

    Source URL: https://slashdot.org/story/25/01/23/2135205/developer-creates-infinite-maze-that-traps-ai-training-bots?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: Developer Creates Infinite Maze That Traps AI Training Bots Feedly Summary: AI Summary and Description: Yes Summary: The text discusses the development of an open-source program called Nepenthes, designed to trap AI web crawlers in an endless loop of link generation, effectively wasting their resources. This innovative approach provides…

  • Hacker News: Thoughts on a Month with Devin

    Source URL: https://www.answer.ai/posts/2025-01-08-devin.html Source: Hacker News Title: Thoughts on a Month with Devin Feedly Summary: Comments AI Summary and Description: Yes **Summary:** The text provides an in-depth analysis of an AI-driven programming assistant named Devin, highlighting both its potential and failures in software development tasks. The initial successes in API interactions and documentation are contrasted…