Slashdot: Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals

Jun 2, 2025

—

Source URL: https://science.slashdot.org/story/25/06/02/172202/web-scraping-ai-bots-cause-disruption-for-scientific-databases-and-journals?utm_source=rss1.0mainlinkanon&utm_medium=feed
Source: Slashdot
Title: Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals

Feedly Summary:

AI Summary and Description: Yes

Summary: The text highlights the impact of automated web-scraping bots on scientific databases and academic journals, driven by the demand for training data for AI models. This has led to significant service disruptions, with many repositories experiencing overwhelming traffic that affects their usability.

Detailed Description: The rise of automated web-scraping bots is raising serious concerns in the fields of AI, cloud computing security, and information security. The influx of bot traffic is largely attributed to the demand for training data to develop increasingly sophisticated AI models. Key points include:

– **Traffic Overload**: Sites like DiscoverLife, which hosts a vast number of species photographs, are experiencing millions of daily hits, effectively rendering them unusable for legitimate users.
– **Recent Developments**: The release of DeepSeek, a Chinese large language model, has sparked an unprecedented demand for data scraping, as it has shown that effective AI models can be constructed with less computational power.
– **Impact on Repositories**: According to the Confederation of Open Access Repositories, a significant majority (over 90%) of surveyed repositories reported instances of AI bot scraping, with nearly two-thirds facing service disruptions.
– **Disruption of Legitimate Access**: Publishers like BMJ are witnessing bot traffic exceed that of genuine users, leading to server overloads and interruptions in customer services.

This situation underscores the urgent need for effective security measures to manage bot traffic and protect valuable datasets, particularly in domains critical for scientific research and AI development. As these disruptions grow more prevalent, compliance and governance around AI training data and web scraping will also become increasingly relevant for institutions and service providers.

– **Key Implications for Security Professionals**:
– The need to implement better traffic management and anomaly detection systems.
– The importance of developing policies to differentiate between legitimate requests and bot traffic.
– Collaboration with AI researchers to find sustainable models of data sharing that do not compromise the integrity of scientific databases.

1 2 5 7 a academic journals access Act ads AI AI development ai model AI models and anomaly detection anomaly detection systems API Arch art as attribute Auto Bi bot scraping bot traffic bots by C CERN Chinese CI Cloud cloud computing cloud computing security co Col collaboration compliance compliance and governance computation computational power Computing concerns core critical Customer customer service D data data scraping data sharing database databases dataset datasets de deep DeepSeek demand detection Detection Systems development developments disruption domain domains DoT drive driven e effective end exp for g Gen git Go governance graph H Helm high Highlight http HTTPS implications implications for security in information information security institutions integrity inter io ite J k Key l Labor language language model large large language model leading led Li life Link lm M man management measures Mode model models N no non o of on open open access OPM ory oS over phi point policies Power pre professionals publishers Q R raising RCE release rendering report research researchers Ro RoT s Sable science scientific research scraping search sec security security measure security measures security professionals server service service disruption service disruptions service providers services SHA sharing Sig source Spark system systems T text the third to Tor TP traffic Traffic Management training training data two UI under up US usability use user Users V val web web scraping Wi x