Hacker News: AI crawlers haven’t learned to play nice with websites

Mar 18, 2025

—

Source URL: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
Source: Hacker News
Title: AI crawlers haven’t learned to play nice with websites

Feedly Summary: Comments

AI Summary and Description: Yes

Summary: SourceHut reports that excessive crawling by AI companies’ web crawlers is disrupting its services. These crawlers, primarily for training large language models (LLMs), have compelled SourceHut to implement several mitigations, including blocking certain cloud providers. The increasing scale of AI bots poses a significant issue across the internet, leading to bandwidth overloads akin to denial-of-service attacks and evokes concerns for infrastructure security in web hosting.

Detailed Description:
– **Excessive Demand**: SourceHut highlights ongoing disruptions caused by aggressive crawling by AI LLMs, leading to bandwidth strain that affects overall service delivery. This emphasizes how cloud-based services and hosting platforms face challenges from AI scraping activities.
– **Mitigation Strategies**: In response to these issues, SourceHut has implemented several measures, such as:
– Deployment of **Nepenthes**, a tar pit designed to catch crawlers.
– Unilateral blocking of major cloud services like Google Cloud and Microsoft Azure due to high bot traffic.
– **Historical Context**: The company recalls previous experiences with similar issues arising from web scraping, indicating a persistent trend.
– **Generative AI Impact**: The surge in usage of generative AI has exacerbated the problem, highlighting the need for controlled crawling practices within the AI community.
– **AI Providers’ Commitments**: Some AI firms (e.g., OpenAI) have committed to respecting **robots.txt** directives, a standard indicating allowances for crawlers.
– **Traffic Analysis**: Data from various services shows a staggering number of requests from recognized AI bots:
– OpenAI’s **GPTbot** generated 569 million requests, significantly impacting hosting bandwidth.
– The phenomenon is leading to traffic patterns where AI bots represent a notable proportion of total requests.
– **Spoofing and Abuse**: Instances of spoofing user-agent strings complicate the identification of legitimate crawlers, creating further challenges in managing web traffic.
– **Ad Metrics Concerns**: Insights from DoubleVerify indicate a rise in general invalid traffic, deemed problematic for ad metrics due to the contributions of AI-based crawlers.

Key Points:
– The rise of AI crawlers poses risks both to the performance of cloud services and the integrity of data collection frameworks.
– Service providers must strike a balance between enabling access for legitimate crawlers and safeguarding against excessive demands that can degrade service quality.
– There’s an emerging necessity for regulatory discussions around ethical data scraping in the AI domain to ensure that innovation does not compromise the infrastructure and integrity of existing services.

The significance of this situation cannot be overstated as it underscores vulnerabilities within cloud infrastructure and raises questions regarding compliance, governance, and operational security in an increasingly AI-driven ecosystem.

1 2 3 5 a abuse access Act ad metrics ads agent AGI AI analysis and API as attack attacks Azure bandwidth based bot traffic bots by C CERN challenges Cloud cloud infrastructure Cloud Provider cloud providers cloud service cloud services cloud-based Col community companies compliance concerns Context control core crawler crawlers cross D data data collection data scraping de demand denial denial-of-service attacks deployment design directive disruption domain Double driven e ecosystem end ethical exp experience face for framework frameworks g Gen general generated generative Generative AI GIS git Go Google Google Cloud governance GPT grade gs H hack hacker Hacker News high Highlight hosting http HTTPS in infrastructure infrastructure security innovation insights integrity inter intern internet invalid traffic ite J k Key l language language model language models large large language model large language models Large Language Models (LLMs) led Li llm llms lm low man metrics Micro Microsoft Microsoft Azure Mila mitigation mitigation strategies mitigations Mode model models N Nepenthes news no non o oE of on open openai operation operational security ory over patterns performance platform platforms play point pre problem quality question R rate RCE recall regulatory report response Risk risks Ro robots robots.txt s safe Scale scraping sec security service service delivery service providers services Sig Sim source spoofing state system T text the to Tor TP traffic traffic analysis training up US usage use user V val vulnerabilities WAN web web crawler web crawlers web hosting web scraping website Wi x