Slashdot: Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals

Source URL: https://science.slashdot.org/story/25/06/02/172202/web-scraping-ai-bots-cause-disruption-for-scientific-databases-and-journals?utm_source=rss1.0mainlinkanon&utm_medium=feed
Source: Slashdot
Title: Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals

Feedly Summary:

AI Summary and Description: Yes

Summary: The text highlights the impact of automated web-scraping bots on scientific databases and academic journals, driven by the demand for training data for AI models. This has led to significant service disruptions, with many repositories experiencing overwhelming traffic that affects their usability.

Detailed Description: The rise of automated web-scraping bots is raising serious concerns in the fields of AI, cloud computing security, and information security. The influx of bot traffic is largely attributed to the demand for training data to develop increasingly sophisticated AI models. Key points include:

– **Traffic Overload**: Sites like DiscoverLife, which hosts a vast number of species photographs, are experiencing millions of daily hits, effectively rendering them unusable for legitimate users.
– **Recent Developments**: The release of DeepSeek, a Chinese large language model, has sparked an unprecedented demand for data scraping, as it has shown that effective AI models can be constructed with less computational power.
– **Impact on Repositories**: According to the Confederation of Open Access Repositories, a significant majority (over 90%) of surveyed repositories reported instances of AI bot scraping, with nearly two-thirds facing service disruptions.
– **Disruption of Legitimate Access**: Publishers like BMJ are witnessing bot traffic exceed that of genuine users, leading to server overloads and interruptions in customer services.

This situation underscores the urgent need for effective security measures to manage bot traffic and protect valuable datasets, particularly in domains critical for scientific research and AI development. As these disruptions grow more prevalent, compliance and governance around AI training data and web scraping will also become increasingly relevant for institutions and service providers.

– **Key Implications for Security Professionals**:
– The need to implement better traffic management and anomaly detection systems.
– The importance of developing policies to differentiate between legitimate requests and bot traffic.
– Collaboration with AI researchers to find sustainable models of data sharing that do not compromise the integrity of scientific databases.