The Register: AI crawlers haven’t learned to play nice with websites

Mar 18, 2025

—

Source URL: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
Source: The Register
Title: AI crawlers haven’t learned to play nice with websites

Feedly Summary: SourceHut says it’s getting DDoSed by LLM bots
SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.…

AI Summary and Description: Yes

Summary: The text discusses the significant impact of aggressive AI web crawlers, particularly those used for training large language models (LLMs), on open-source services like SourceHut. It highlights ongoing issues related to excessive bot traffic, measures taken by companies to mitigate these effects, and the broader implications for developers and service providers.

Detailed Description:
The article examines the negative consequences of AI web crawlers on open-source platforms, primarily through the lens of SourceHut’s experiences. Key points include:

– **Service Disruptions**: SourceHut reports that aggressive crawlers, mainly from AI companies, are causing service slowdowns. This is a recurring issue that has pushed SourceHut to implement various mitigation strategies.

– **Mitigation Strategies**:
– Deployment of Nepenthes, described as a “tar pit” for capturing web crawlers.
– Unilateral blocking of several cloud providers (e.g., GCP, Microsoft Azure) due to overwhelming bot traffic.

– **Historical Context**: The text references past incidents of high traffic due to crawlers and likens the overload to denial-of-service attacks. This pattern is consistent across other open-source projects, underlining a growing issue caused by LLM training bots.

– **Recent Trends**:
– There has been a marked increase in bot activity over the last two years, which correlates with the rise of generative AI technologies.
– In 2023, certain AI companies, such as OpenAI, asserted commitments to respecting `robots.txt` files, yet reports of ongoing crawler abuse persist.

– **Statistics**:
– On Vercel’s network, OpenAI’s GPTbot and Anthropic’s ClaudeBot generated substantial traffic, comprising a significant portion of overall web requests.

– **Spoofing Concerns**: Developers have reported issues with user-agent string spoofing, complicating traffic analysis and mitigation efforts.

– **Industry Impact**: Incidents of crawling abuse have led to increased invalid traffic in advertising metrics, with AI crawlers becoming a notable factor.

– **Technical Developments**: Google introduced a new tokens feature in robots.txt for site owners to restrict their content from being used in AI training datasets without sacrificing their web indexing.

Professionals in AI, cloud, and information security should take note of:
– The need for stringent access controls and compliance measures regarding AI data scraping practices.
– The potential for increased traffic scrutiny to identify abusive scrapers effectively.
– Engagement with cloud providers to address crawler issues proactively, possibly leading to collaborations on developing more refined crawling technologies or compliance frameworks.

This underscores a critical need for balance between innovative AI developments and the operational integrity of web services.

1 2 3 5 a abuse access access control access controls Act advertising agent AI AI development AI technologies analysis and Anthropic anti API art as attack attacks Azure being bot activity bot traffic bots by C CERN Claude Cloud Cloud Provider cloud providers Col collaboration companies compliance compliance framework compliance frameworks compliance measures concerns content Context control controls core crawler crawlers critical cross D data data scraping data scraping practices. dataset datasets DDoS de demand denial denial-of-service attacks deployment developer developers development disruption DoS e effective end engagement exp experience fact feature fine for framework frameworks g Gen generated generative Generative AI GIS git Go Google GPT H high Highlight hosting HR http HTTPS implications in incident indexing industry industry impact information information security integrity invalid traffic ite J k Key l Labor language language model language models large large language model large language models Large Language Models (LLMs) led Li llm llms lm low man metrics Micro Microsoft Microsoft Azure mitigation mitigation efforts mitigation strategies Mode model models N Nepenthes network no o of on open open-source openai operation operational integrity OPM out over platform platforms play point potential proactive professionals project projects R rate RCE report Ro robots robots.txt s scraper scraping sec security sequence service service disruption service disruptions service providers services Sig source source Platform source projects spoofing SSE T tech technologies text the to token tokens Tor TP traffic traffic analysis training training bots training data training datasets trends two up US use user V val Vercel web web crawler web crawlers web services website Wi x