The Cloudflare Blog: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Aug 4, 2025

—

Source URL: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
Source: The Cloudflare Blog
Title: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives

Feedly Summary: Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites.

AI Summary and Description: Yes

Summary: The text discusses the stealth crawling behavior of Perplexity, an AI-powered answer engine that attempts to evade website crawling rules. It highlights issues of trust in web crawling practices and contrasts Perplexity’s behavior with that of responsible crawlers like OpenAI’s ChatGPT. The implications of this behavior on web content protection and bot management systems are also addressed, providing insights for professionals in AI security, information security, and cloud computing.

Detailed Description: The text outlines critical observations regarding Perplexity’s crawling practices, shedding light on ethical concerns within the realm of AI-based web crawling. Here are the key points:

– **Stealth Crawling Behavior**:
– Perplexity disguises its crawling activity by modifying its user agent and using different ASNs (Autonomous System Numbers) to avoid website blocks.
– It frequently ignores robots.txt directives, which are essential for guiding crawler behavior.

– **Customer Complaints**:
– Customers reported issues where Perplexity’s crawling persisted despite explicit disallowances in robots.txt files and firewall rules against its declared bots.
– Tests using newly created and unindexed domains confirmed that Perplexity was able to return detailed content information, indicating effective evasion strategies.

– **Obfuscation Techniques**:
– It was noted that Perplexity uses both declared user agents and generic browser identifiers to bypass restrictions, evidencing a lack of adherence to web crawling norms.
– Crawling activities were observed across numerous domains, utilizing a rotating set of IPs to further circumvent security measures like those implemented by Cloudflare.

– **Best Practices for Crawlers**:
– The Internet favors crawlers that act transparently, respect web directives, and serve clear purposes. These best practices are exemplified by OpenAI’s ChatGPT, which adheres to the norms regarding crawling and handling website preferences responsibly.

– **Protection Measures**:
– Recommendations for website operators include implementing bot management systems that challenge unauthorized crawling attempts.
– Cloudflare has introduced managed rules and features allowing website owners to block or manage AI crawlers more effectively.

– **Looking Ahead**:
– The text emphasizes the ongoing evolution of crawling behaviors and the need for adaptive countermeasures.
– Efforts to establish clear principles and standards for ethical crawling practices are underway, highlighting the role of organizations like IETF.

This analysis provides vital insights into the challenges posed by AI crawlers and the evolving strategies for content protection from unauthorized data scraping, reinforcing the importance of robust security measures for website owners and security professionals in the digital landscape.

a Act adaptive age agent agents AI AI crawlers AI security alt analysis and API as at ated Auto autonomous based Behavior Best best practices bot management bots browser by bypass C CERN challenge challenges chat ChatGPT CI CleaR Cloud cloud computing Cloudflare co Computing concerns content content protection Countermeasures crawler crawlers critical cross Customer D data data scraping de digital digital landscape directive domain domains e effective end ERP ethical ethical concerns evasion exp feature features file firewall firewall rules for g Gen git Go GPT H handling high Highlight http HTTPS identifiers implications in information information security insights inter intern internet io issue ite k Key l land led Li lm low M man Managed Rules management Management System management systems measures ModI N new NGO no o obfuscation obfuscation techniques of on ons open openai Operator organization organizations oS out per perplexity point Power powered practices pre principles pro professionals protection protection measures ps Q R rate RCE re real recommendations red report responsible restrictions return Ro robots robots.txt robots.txt file robust security Role RoT row Rust s scraping sec security security measure security measures security professionals Sig size sizes source SSE standards stealth strategies system systems T tech techniques ted test text the to Tor TP transparent trust turn UI unauthorized data scraping under US use user user agents uth V WAN web web content web crawling website Wi x z