Cloudflare vs. the AI Crawlers: A Last Stand for the Open Web

Cloudflare recently announced a new feature that blocks AI crawling bots on behalf of its customers. If you're unfamiliar with a crawler, it's essentially software that analyzes web pages and categorizes the information it finds.

In the age of search, crawlers helped search engines like Google develop their massive database used to return relevant search results. Now, generative AI chat tools need to do the same thing to return appropriate responses to users. 

Blocking Bots

Publishers have always had the option to block Google crawlers or other bots, but this is akin to financial suicide for any publishers that rely on advertising revenue. Nobody with advertising as their primary business model would ever block a Google bot since it would limit the chances of your pages surfacing in search results, which bring traffic and eyeballs to your ads. 

In the age of AI, when users receive the information they need directly in a response, whether that be a Google AI Overview, ChatGPT, Perplexity, or another AI tool, they have less reason to visit a site with the same information. Publishers can instruct bots not to crawl their website by updating the "robots.txt" file on their root domain and disallowing the bots' user agent. 

If you check out the New York Times robots.txt file, you can see entries to block both ChatGPT and Anthropic in this file:

User-agent: ChatGPT-User
Disallow: /

User-agent: anthropic-ai
Disallow: /

These blocks are not surprising given the New York Times copyright infringement lawsuit against OpenAI. 

Companies are not required to obey robots.txt and can choose to ignore it; the file itself doesn't technically block a bot. But tools like those offered by Cloudflare do technically block the bot before it can hit a web server hosting content. 

Unsurprisingly, some AI scrapers may be acting nefariously to circumvent restrictions. It’s rumored that some players (Perplexity, for example) might be spoofing user agents, which they could be doing to get around blocking mechanisms. Generative AI search tools are in fierce competition, and they may seek any means to gain an edge, despite ethical or legal concerns. 

But Cloudflare recognizes these shady practices and uses advanced analysis to detect spoofing and other evasion mechanisms. The company now asks all new clients if they wish to block any bot it finds (while providing the same tools to existing customers). 

Additionally, Cloudflare is looking to create or facilitate a "pay per crawl“ market that would compensate content creators for the content they produce by giving publishers the capability to signal to AI agents that they have to pay to crawl a site. 

Cloudflare wants to do this using existing HTTP status codes:

Pay per crawl integrates with existing web infrastructure, leveraging HTTP status codes and established authentication mechanisms to create a framework for paid content access. 

Each time an AI crawler requests content, they either present payment intent via request headers for successful access (HTTP response code 200), or receive a 402 Payment Required response with pricing. Cloudflare acts as the Merchant of Record for pay per crawl and also provides the underlying technical infrastructure.

So Cloudflare could allow or deny access to content using HTTP status codes, and act as the broker to collect payment:

An important mechanism here is that even if a crawler doesn’t have a billing relationship with Cloudflare, and thus couldn’t be charged for access, a publisher can still choose to ‘charge’ them. This is the functional equivalent of a network level block (an HTTP 403 Forbidden response where no content is returned) — but with the added benefit of telling the crawler there could be a relationship in the future.

It's a cool idea and an admirable pursuit, but it also feels like a last-ditch effort by a company that is sure to see a monetary impact if the web contracts in any way due to usage-based pricing. After all, Cloudflare offers web infrastructure services to site owners, so if fewer people are navigating the web, then there is less need for Cloudflare's services. Much like ad tech companies and publishers that rely on open web advertising, the rise of generative AI chat and search directly threatens Cloudflare's business.

Meanwhile, Perplexity offered a glimpse of how a different (more closed) model to compensate publishers could work. As you might imagine, there are some notable differences between Cloudflare’s and Perplexity’s approaches: 

  • The Cloudflare model compensates publishers for any crawl for AI training, whereas Perplexity only shares revenue when it uses specific publisher content to formulate a generative AI response. 

  • Additionally, Cloudflare's approach aims to establish an open standard for all companies to adopt, whereas Perplexity's model is a direct arrangement between a publisher and Perplexity.

In either case, we are inexorably marching toward a future where AI relegates any content creator to a mere input mechanism for insatiable AI models. Google once controlled the front door to most content, but now that entryway is closing in favor of an always-on digital information siphon.

Each publisher that adopts Cloudflare's pay-per-crawl model, inks an AI content licensing deal, or joins Perplexity's revenue-sharing program is attempting to extract value from a seismic shift occurring in the economics of open web content creation. However, they might also be signing their own death certificates, as they give more credence to the very models that will obviate the need to visit any publisher's property.

The future of the web

So what's to become of content creation on the web? If you project forward trends that are already happening, I think this is where you end up:

  1. Web traffic is declining due to generative AI chat and search.

  2. Web publishers are losing advertising revenue due to fewer visits to their properties.

  3. Advertising revenue is shifting to destinations not impacted by this phenomenon. 

  4. Remaining web publishers and content publishers are pivoting their business models to capitalize on these trends.

I don't believe that all web publishing will die, but a majority will. Gone are websites that rely on search traffic, such as SEO parasites that cater to simple queries like "where can I stream tonight's NFL game?" or recipe sites. 

Would you rather scroll this ad-riddled hellscape to find a chicken marinade or have it presented to you in perfect formatting with the ability to ask follow-up questions? The answer is clear. 

Ad-riddled hellscape or beautifully formatted perfection? You choose.

Cloudflare's pay-per-crawl model reminds me of the doomed micro-transaction model, where users would pay a minimal fee to access content. But everyone quickly realized that paying for content from unknown entities made little to no sense, as users were too accustomed to free access to general content. Users, however, were willing to pay media brands and individual content creators they trust.

Twitch proved this model in video for individual creators, and it has since transitioned to the written word with platforms like Substack. Investors may understand that unique human thoughts amid a sea of AI-generated content will stand out and hold value in the future, as Substack recently raised $100 million at a $1.1 billion valuation

Forming a direct connection with users via subscriptions remains a sustainable business model in our AI-driven future. And, perhaps it's not all doom and gloom for advertising on the web.

If the web contracts and web advertising supply along with it, then naturally, one would expect prices to rise on the remaining inventory. If publishers can endure the initial reduction in traffic and adopt alternative business models (such as subscriptions or AI content licensing), then the remaining web inventory might see some price improvement due to scarcity.

The more likely scenario is that we will see an acceleration of advertising spend flowing to alternative sources. The bad news for most publishers is that the most likely landing spot for this revenue is the very tools displacing the entire web, along with the social platforms pumping out AI-generated content that will keep the masses entertained in a dopamine-fueled endless scroll of blissful satisfaction. 

The good news for streaming publishers is that generative AI is less likely to displace video content in the short term, and could make this inventory more valuable as supply from the web dries up. CTV becomes increasingly valuable as advertisers seek to expand their reach beyond the behemoth platforms, with nowhere else to turn.

I weep for the future (humans)

Despite the best efforts of companies like Cloudflare, it's hard not to be pessimistic about the future of web publishing that relies on advertising.

As open web advertising economics decline, more content will shift to walled gardens. In these gardens, content creators might be required to allow the use of their content to train models (e.g., Gemini in the case of Google, or Llama in the case of Meta) if they want to publish on YouTube or Instagram. 

To take the most pessimistic view possible, most content creation will likely move beyond human control. Impossibly beautiful influencers (because they aren't real) are gaining follower counts rapidly. Google's NotebookLM generates podcasts on demand about any topic. And virtual content creators are now emerging in YouTube videos. 

There will always be a craving for authentic, artisanal-crafted human content, but it could eventually represent the vast minority of all content on the internet. Cultural backlash against AI-generated content is real, but I think most humans will gravitate toward any content that entertains or delights them.

Reply

or to participate.