Source URL: https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
Source: The Cloudflare Blog
Title: Perplexity is using stealth, undeclared crawlers to evade website no-crawl directives
Feedly Summary: Perplexity is repeatedly modifying their user agent and changing IPs and ASNs to hide their crawling activity, in direct conflict with explicit no-crawl preferences expressed by websites.
AI Summary and Description: Yes
Summary: The text discusses the stealth crawling behavior of Perplexity, an AI-powered answer engine that attempts to evade website crawling rules. It highlights issues of trust in web crawling practices and contrasts Perplexity’s behavior with that of responsible crawlers like OpenAI’s ChatGPT. The implications of this behavior on web content protection and bot management systems are also addressed, providing insights for professionals in AI security, information security, and cloud computing.
Detailed Description: The text outlines critical observations regarding Perplexity’s crawling practices, shedding light on ethical concerns within the realm of AI-based web crawling. Here are the key points:
– **Stealth Crawling Behavior**:
– Perplexity disguises its crawling activity by modifying its user agent and using different ASNs (Autonomous System Numbers) to avoid website blocks.
– It frequently ignores robots.txt directives, which are essential for guiding crawler behavior.
– **Customer Complaints**:
– Customers reported issues where Perplexity’s crawling persisted despite explicit disallowances in robots.txt files and firewall rules against its declared bots.
– Tests using newly created and unindexed domains confirmed that Perplexity was able to return detailed content information, indicating effective evasion strategies.
– **Obfuscation Techniques**:
– It was noted that Perplexity uses both declared user agents and generic browser identifiers to bypass restrictions, evidencing a lack of adherence to web crawling norms.
– Crawling activities were observed across numerous domains, utilizing a rotating set of IPs to further circumvent security measures like those implemented by Cloudflare.
– **Best Practices for Crawlers**:
– The Internet favors crawlers that act transparently, respect web directives, and serve clear purposes. These best practices are exemplified by OpenAI’s ChatGPT, which adheres to the norms regarding crawling and handling website preferences responsibly.
– **Protection Measures**:
– Recommendations for website operators include implementing bot management systems that challenge unauthorized crawling attempts.
– Cloudflare has introduced managed rules and features allowing website owners to block or manage AI crawlers more effectively.
– **Looking Ahead**:
– The text emphasizes the ongoing evolution of crawling behaviors and the need for adaptive countermeasures.
– Efforts to establish clear principles and standards for ethical crawling practices are underway, highlighting the role of organizations like IETF.
This analysis provides vital insights into the challenges posed by AI crawlers and the evolving strategies for content protection from unauthorized data scraping, reinforcing the importance of robust security measures for website owners and security professionals in the digital landscape.