Source URL: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
Source: Hacker News
Title: AI crawlers haven’t learned to play nice with websites
Feedly Summary: Comments
AI Summary and Description: Yes
Summary: SourceHut reports that excessive crawling by AI companies’ web crawlers is disrupting its services. These crawlers, primarily for training large language models (LLMs), have compelled SourceHut to implement several mitigations, including blocking certain cloud providers. The increasing scale of AI bots poses a significant issue across the internet, leading to bandwidth overloads akin to denial-of-service attacks and evokes concerns for infrastructure security in web hosting.
Detailed Description:
– **Excessive Demand**: SourceHut highlights ongoing disruptions caused by aggressive crawling by AI LLMs, leading to bandwidth strain that affects overall service delivery. This emphasizes how cloud-based services and hosting platforms face challenges from AI scraping activities.
– **Mitigation Strategies**: In response to these issues, SourceHut has implemented several measures, such as:
– Deployment of **Nepenthes**, a tar pit designed to catch crawlers.
– Unilateral blocking of major cloud services like Google Cloud and Microsoft Azure due to high bot traffic.
– **Historical Context**: The company recalls previous experiences with similar issues arising from web scraping, indicating a persistent trend.
– **Generative AI Impact**: The surge in usage of generative AI has exacerbated the problem, highlighting the need for controlled crawling practices within the AI community.
– **AI Providers’ Commitments**: Some AI firms (e.g., OpenAI) have committed to respecting **robots.txt** directives, a standard indicating allowances for crawlers.
– **Traffic Analysis**: Data from various services shows a staggering number of requests from recognized AI bots:
– OpenAI’s **GPTbot** generated 569 million requests, significantly impacting hosting bandwidth.
– The phenomenon is leading to traffic patterns where AI bots represent a notable proportion of total requests.
– **Spoofing and Abuse**: Instances of spoofing user-agent strings complicate the identification of legitimate crawlers, creating further challenges in managing web traffic.
– **Ad Metrics Concerns**: Insights from DoubleVerify indicate a rise in general invalid traffic, deemed problematic for ad metrics due to the contributions of AI-based crawlers.
Key Points:
– The rise of AI crawlers poses risks both to the performance of cloud services and the integrity of data collection frameworks.
– Service providers must strike a balance between enabling access for legitimate crawlers and safeguarding against excessive demands that can degrade service quality.
– There’s an emerging necessity for regulatory discussions around ethical data scraping in the AI domain to ensure that innovation does not compromise the infrastructure and integrity of existing services.
The significance of this situation cannot be overstated as it underscores vulnerabilities within cloud infrastructure and raises questions regarding compliance, governance, and operational security in an increasingly AI-driven ecosystem.