September 23, 2025

From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

Note: this post has been marked as obsolete.

The era of traditional search engine optimization is rapidly evolving into "AIO" (AI optimization), where businesses must ensure their content exists in AI training datasets to remain discoverable as users increasingly turn to AI assistants for answers, a shift that's already driving real business impact today and making presence in AI training data as strategically vital as traditional search rankings once were.

Stephen Burns

Stephen Burns is Web Intelligence Lead at the Common Crawl Foundation.

I've been working in search for over two decades, from the Open Directory Project days through building search engines at Blekko to my current role leading web intelligence at Common Crawl. But it wasn't until a customer found my motorcycle repair side hustle through ChatGPT that the magnitude of what's happening really hit me.

A few years back, I restored an old BMW and started fixing bikes out of my garage in Redwood City. Being obsessive about local SEO, I made sure the shop ranked well locally. Then something unexpected happened: customers started showing up saying an AI assistant told them about me. They'd asked ChatGPT where to get their motorcycle fixed, and it sent them to my garage.

That's when it clicked: being visible in AI systems isn't just about future-proofing anymore. It's already driving real business today.

The Fundamental Shift in User Behavior

For decades, we optimized for short, typed queries. "Buy shoes." "Hotels Bangkok." "Pizza near me." Two or three words into a search box, ten blue links back. The SEO playbook was straightforward: get indexed, then fight for rankings.

But watch how people interact with ChatGPT, Claude, or Perplexity now. They don't type "hotels Bangkok." They say: "I'm planning a trip to Chiang Mai for November, want something boutique near the old city with a pool, not too touristy, under $200 a night, what are my options?"

That's a twenty-word question, often spoken aloud. Users no longer want ten links to evaluate; they want the single synthesized answer, the filtered recommendation, the complete plan. They expect the system to read, compare, decide, and deliver.

This shift from two typed words to twenty spoken ones represents the most disruptive change search has ever seen. And it's why discovery itself is being re-architected from the ground up.

When Invisibility Becomes Life-Threatening

The stakes couldn't be higher. Children's Hospital of Los Angeles is one of the top pediatric cancer centers in the United States. Yet when parents search within Gemini or ChatGPT for "where should I take my child with leukemia in LA?" they don't see CHLA.

Why? The hospital's site sits behind Cloudflare, whose default settings inject a robots.txt that blocks AI crawlers, including our CCBot at Common Crawl. In the world of AI-powered search, this premier hospital effectively doesn't exist.

This goes beyond lost web traffic. Families may be unable to find potentially life-saving care for their children, not because the hospital deliberately opted out of AI systems, but because a default setting at the hospital’s content delivery provider blocked AI crawlers without the hospital realising it.

Understanding the New Discovery Pipeline

To grasp why this matters, you need to understand how large language models actually work. LLMs aren't real-time systems. They're trained on static snapshots of the web, a process that takes weeks or months. Once training concludes, the model's knowledge is frozen. That's why ChatGPT will tell you "my knowledge cutoff is April 2023" or similar.

Retrieval-augmented generation (RAG) bridges this gap. When you ask a question, these systems can fetch fresh web pages and blend that current information with the model's frozen knowledge base. That's how Perplexity can tell you about yesterday's news despite its underlying model knowing nothing about it.

Here's the critical nuance: Technically, you don't need to be in the training data to appear in RAG results. If your page is crawlable and indexed by whatever live source the system uses, you could theoretically be pulled in.

But practically if your brand, product, or entity is absent from the training data, the model doesn't know you exist. It won't expand queries to include you. It won't recognize you as relevant. Your retrieval chances plummet.

Being in the training corpus doesn't guarantee retrieval, but being absent from it dramatically lowers your odds of ever being surfaced. Visibility at the crawl layer has become as strategically vital as backlinks once were.

The Opt-Out Crisis We're Witnessing

At Common Crawl, we're experiencing this shift firsthand. After the New York Times blocked AI crawlers, we saw a wave of copycats. Publishers of all sizes, some polite, others threatening legal action, demanded removal from our dataset.

We created an Opt-Out Registry. When a site requests exclusion or threatens us legally, we don't just remove them from Common Crawl. We flag them for the entire ecosystem: OpenAI, Meta, Amazon, researchers, everyone.

For these publishers, exclusion isn't temporary. It's essentially permanent.

Here's the irony: AI models don't actually need these sites to answer user questions. The knowledge surfaces anyway through reviews, forums, citations, and other user-generated content. When a brand opts out, the conversation about them continues. What disappears is their authoritative voice in that conversation.

The Language Divide

Another critical factor is language representation. The vast majority of training data is English. Smaller languages like Estonian, Catalan, even Thai, are massively underrepresented in these models.

For businesses operating in smaller language markets, this is existential. Publish only in Catalan, and your content may be invisible in AI-driven answers because the model lacks sufficient Catalan material to generalize from.

The practical strategy? Publish in English alongside your local language. English acts as the gateway into the model. You're not abandoning your local audience; you're ensuring legibility to the systems that increasingly serve as the first point of discovery.

When Infrastructure Becomes a Ranking Factor

The infrastructure constraints are staggering. Training GPT-4 cost between $78-100 million in computation costs alone, with total costs reaching hundreds of millions when including hardware. The energy footprint is enormous: some training runs burn more electricity than a large hydroelectric dam generates per minute.

These constraints directly shape what gets crawled and processed. Microsoft is reviving Three Mile Island for AI power. Google is investing in small modular nuclear reactors. When computation is this expensive, crawlers must prioritize what they consider high-value content.

Infrastructure is no longer invisible; it's effectively become a ranking factor. Not every site gets equal treatment. If you're deemed lower priority, you might not make it into the next training run.

The New Visibility Funnel

Think of the new discovery funnel this way:

Training Data → LLM → RAG → Your Site → Conversion

At the top, if you're absent from training data, you're missing critical baseline awareness. In the middle, if you're blocked at the crawl layer, you're excluded entirely. At the bottom, if you're invisible to retrieval systems, you don't get surfaced, summarized, or converted.

This is the new reality of online visibility.

What You Need to Do Now

First, audit your crawl accessibility immediately. Check whether your site is being blocked at the CDN level. If you're on Cloudflare, verify that AI crawlers are actually allowed. Don't assume; check your logs. If you don't see CCBot, GPTBot, or ClaudeBot, you may already be invisible.

Second, publish in English and your local language. English remains the gateway language into these models.

Third, syndicate and distribute your content. Don't rely on a single domain. Just as backlinks once created resilience, syndication today increases the odds that your content survives preprocessing and makes it into training datasets.

Fourth, monitor the evolving landscape. Defaults change. New players emerge. Not everyone follows the stated rules.

Finally, educate your stakeholders. Many executives still think SEO is about title tags and keyword density. They don't realize their site may already be invisible in the fastest-growing discovery systems on earth.

The Bottom Line

SEO has always been about visibility. That hasn't changed. What's changed is the mechanism.

The old world was about index and rank. The new world is about training and retrieval.

Training data has become the new link graph. The strategic asset isn't your PageRank anymore; it's your presence in the corpus.

If you're not in the crawl, you're not in the model, and if you're not in the model, you may not be in the market.

The choice is yours, but make it consciously. Because right now decisions about your AI visibility might be getting made by your CDN, your legal department, or your trade association without you even knowing it.

Stephen Burns is the Web Intelligence Lead at Common Crawl Foundation, the nonprofit that provides open web data to AI researchers and companies worldwide. He also works in enterprise SEO at U.S. Bank and operates a motorcycle repair shop that customers keep finding through ChatGPT.

This release was authored by:

Stephen Burns

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

Originally reported by:

covuworie

The nodes in domain-level Web Graphs may not be properly sorted lexicographically by node label (reversed domain name). It's also possible that few nodes are duplicated, that is two nodes share the same label. For more details, see the Issue Report in the cc-webgraph repository.

The issue affects all domain-level Web Graphs until the issue has been fixed for the May, June/July, August 2022 Web Graph (cc-main-2022-may-jun-aug-domain) and the following Web Graph releases.

Erratum:

Content is truncated

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

The Data

Resources

Community

About