web crawling – Experimental News Clipping Site

The Cloudflare Blog: A deeper look at AI crawlers: breaking down traffic by purpose and industry

Aug 28, 2025

—

by

Source URL: https://blog.cloudflare.com/ai-crawler-traffic-by-purpose-and-industry/ Source: The Cloudflare Blog Title: A deeper look at AI crawlers: breaking down traffic by purpose and industry Feedly Summary: We are extending AI-related insights on Cloudflare Radar with new industry-focused data and a breakdown of bot traffic by purpose, such as training or user action. AI Summary and Description: Yes Summary:…

The Register: Perplexity vexed by Cloudflare’s claims its bots are bad

Aug 5, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://www.theregister.com/2025/08/05/perplexity_vexed_by_cloudflares_claims/ Source: The Register Title: Perplexity vexed by Cloudflare’s claims its bots are bad Feedly Summary: AI search biz insists its content capture and summarization is okay because someone asked for it AI search biz Perplexity claims that Cloudflare has mischaracterized its site crawlers as malicious bots and that the content delivery network…

Simon Willison’s Weblog: ChatGPT agent triggers crawls from Bingbot and Yandex

Aug 4, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://simonwillison.net/2025/Aug/4/chatgpt-agents-agent/#atom-everything Source: Simon Willison’s Weblog Title: ChatGPT agent triggers crawls from Bingbot and Yandex Feedly Summary: ChatGPT agent is the recently released (and confusingly named) ChatGPT feature that provides browser automation combined with terminal access as a feature of ChatGPT – replacing their previous Operator research preview which is scheduled for deprecation on…

The Cloudflare Blog: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Aug 4, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/ Source: The Cloudflare Blog Title: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives Feedly Summary: Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites. AI Summary and Description: Yes Summary: The…

Slashdot: Perplexity is Using Stealth, Undeclared Crawlers To Evade Website No-Crawl Directives, Cloudflare Says

Aug 4, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://tech.slashdot.org/story/25/08/04/1459240/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives-cloudflare-says?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: Perplexity is Using Stealth, Undeclared Crawlers To Evade Website No-Crawl Directives, Cloudflare Says Feedly Summary: AI Summary and Description: Yes Summary: The report highlights ethical concerns regarding the web crawling practices of the AI startup Perplexity. By using undetected methods to bypass website restrictions on automated access, this behavior…

The Register: Anubis guards gates against hordes of LLM bot crawlers

Jul 9, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://www.theregister.com/2025/07/09/anubis_fighting_the_llm_hordes/ Source: The Register Title: Anubis guards gates against hordes of LLM bot crawlers Feedly Summary: Using proof of work to block the web-crawlers of ‘AI’ companies Anubis is a sort of CAPTCHA test, but flipped: instead of checking visitors are human, it aims to make web crawling prohibitively expensive for companies trying…

The Cloudflare Blog: From Googlebot to GPTBot: who’s crawling your site in 2025

Jul 1, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/ Source: The Cloudflare Blog Title: From Googlebot to GPTBot: who’s crawling your site in 2025 Feedly Summary: From May 2024 to May 2025, crawler traffic rose 18%, with GPTBot growing 305% and Googlebot 96%. AI Summary and Description: Yes Summary: The text discusses the evolution of web crawlers, particularly focusing on the…

The Cloudflare Blog: Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content

Jul 1, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://blog.cloudflare.com/control-content-use-for-ai-training/ Source: The Cloudflare Blog Title: Control content use for AI training with Cloudflare’s managed robots.txt and blocking for monetized content Feedly Summary: Cloudflare is making it easier for publishers and content creators of all sizes to prevent their content from being scraped for AI training by managing robots.txt on their behalf. AI…

Slashdot: Microsoft’s Plan To Fix the Web: Letting Every Website Run AI Search for Cheap

May 19, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://tech.slashdot.org/story/25/05/19/1729259/microsofts-plan-to-fix-the-web-letting-every-website-run-ai-search-for-cheap Source: Slashdot Title: Microsoft’s Plan To Fix the Web: Letting Every Website Run AI Search for Cheap Feedly Summary: AI Summary and Description: Yes Summary: Microsoft has introduced NLWeb, an innovative open protocol aimed at enhancing AI-driven search features for websites and applications, allowing for natural language queries to be processed efficiently.…

Slashdot: AI Crawlers Haven’t Learned To Play Nice With Websites

Mar 19, 2025

—

by

Kurt Seifried

in Uncategorized

Source URL: https://slashdot.org/story/25/03/19/1027251/ai-crawlers-havent-learned-to-play-nice-with-websites?utm_source=rss1.0mainlinkanon&utm_medium=feed Source: Slashdot Title: AI Crawlers Haven’t Learned To Play Nice With Websites Feedly Summary: AI Summary and Description: Yes Summary: SourceHut is experiencing service disruptions due to aggressive web crawling by AI companies collecting data for training large language models (LLMs). They have implemented mitigations, including blocking certain cloud providers due to…

Tag: web crawling