Perplexity Caught Bypassing Web Scraping Restrictions

April 20, 20261 min read

TL;DR

Cloudflare research claims the AI startup ignored site access rules, raising fresh questions about how AI companies collect training data.

Cloudflare has published research accusing AI startup Perplexity of systematically scraping content from websites that explicitly blocked such activity. The internet infrastructure provider claims Perplexity used technical methods to obscure its identity and bypass restrictions set by website owners.

According to Cloudflare's findings, Perplexity allegedly changed its bots' user agents and autonomous system networks to circumvent blocks. This activity reportedly spanned tens of thousands of domains with millions of daily requests, detected through machine learning and network signal analysis.

Perplexity spokesperson Jesse Dwyer dismissed the allegations as a "sales pitch" and claimed the identified bot wasn't theirs. In email correspondence with TechCrunch, Dwyer asserted that Cloudflare's screenshots showed no content was actually accessed during the alleged scraping attempts.

The conflict emerged after Cloudflare customers complained about continued scraping despite implementing Robots.txt blocks and specific Perplexity bot restrictions. Cloudflare conducted tests confirming the circumvention attempts, noting Perplexity used generic browser identifiers mimicking Google Chrome when its declared crawler was blocked.

This isn't the first time Perplexity has faced scraping allegations. Last year, news outlets including Wired accused the company of plagiarizing content. During TechCrunch's Disrupt 2024 conference, CEO Aravind Srinivas struggled to define plagiarism when questioned about the company's practices.

Cloudflare has taken concrete action, delisting Perplexity's bots from its verified list and implementing new blocking techniques. The company has recently positioned itself against aggressive AI scraping, launching tools and marketplaces to help website owners control and monetize AI data access.

The timing coincides with growing industry tension around AI training data sourcing. As AI companies increasingly rely on web scraping for model development, infrastructure providers and content creators are pushing back against what they see as unauthorized data collection.

Cloudflare CEO Matthew Prince previously warned that "AI is breaking the business model of the internet," particularly affecting publishers. The company's recent initiatives reflect broader industry efforts to establish clearer rules for AI data collection practices.