Source URL: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
Source: The Register
Title: AI crawlers haven’t learned to play nice with websites
Feedly Summary: SourceHut says it’s getting DDoSed by LLM bots
SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.…
AI Summary and Description: Yes
Summary: The text discusses the significant impact of aggressive AI web crawlers, particularly those used for training large language models (LLMs), on open-source services like SourceHut. It highlights ongoing issues related to excessive bot traffic, measures taken by companies to mitigate these effects, and the broader implications for developers and service providers.
Detailed Description:
The article examines the negative consequences of AI web crawlers on open-source platforms, primarily through the lens of SourceHut’s experiences. Key points include:
– **Service Disruptions**: SourceHut reports that aggressive crawlers, mainly from AI companies, are causing service slowdowns. This is a recurring issue that has pushed SourceHut to implement various mitigation strategies.
– **Mitigation Strategies**:
– Deployment of Nepenthes, described as a “tar pit” for capturing web crawlers.
– Unilateral blocking of several cloud providers (e.g., GCP, Microsoft Azure) due to overwhelming bot traffic.
– **Historical Context**: The text references past incidents of high traffic due to crawlers and likens the overload to denial-of-service attacks. This pattern is consistent across other open-source projects, underlining a growing issue caused by LLM training bots.
– **Recent Trends**:
– There has been a marked increase in bot activity over the last two years, which correlates with the rise of generative AI technologies.
– In 2023, certain AI companies, such as OpenAI, asserted commitments to respecting `robots.txt` files, yet reports of ongoing crawler abuse persist.
– **Statistics**:
– On Vercel’s network, OpenAI’s GPTbot and Anthropic’s ClaudeBot generated substantial traffic, comprising a significant portion of overall web requests.
– **Spoofing Concerns**: Developers have reported issues with user-agent string spoofing, complicating traffic analysis and mitigation efforts.
– **Industry Impact**: Incidents of crawling abuse have led to increased invalid traffic in advertising metrics, with AI crawlers becoming a notable factor.
– **Technical Developments**: Google introduced a new tokens feature in robots.txt for site owners to restrict their content from being used in AI training datasets without sacrificing their web indexing.
Professionals in AI, cloud, and information security should take note of:
– The need for stringent access controls and compliance measures regarding AI data scraping practices.
– The potential for increased traffic scrutiny to identify abusive scrapers effectively.
– Engagement with cloud providers to address crawler issues proactively, possibly leading to collaborations on developing more refined crawling technologies or compliance frameworks.
This underscores a critical need for balance between innovative AI developments and the operational integrity of web services.