August 11, 2025

AI Optimization Is Here: Are You Ready for Search 2.0?

Note: this post has been marked as obsolete.

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Publishers and brands are shifting from search engine optimization (SEO) to AI optimization (AIO). Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO.

As an SEO professional, you’ve spent years mastering the art of visibility in Google’s ecosystem. You understand crawlers, you optimize for algorithms, and you’ve built strategies around how search engines discover and rank content.

Now, the landscape is shifting quickly. AI-powered search and answer engines are fundamentally changing how people find information, and many SEOs are discovering that their traditional optimization strategies leave gaps in this new environment.

A conversation with a chatbot called “Alice and Bob’s Travel Chatbot” where the user is asking “Where’s the best beach hotel in Hawaii?” — Alice and Bob’s Travel Chatbot could be how your brand’s information is surfaced

The Evolution from Search 1.0 to Search 2.0

Search 1.0 was dominated by a single player with a clear business model: Google controlled the index, you optimized for their algorithm, and everyone understood the rules. The SEO industry grew to $80 billion, and search ads became a $300 billion market.

Search 2.0 is different. Multiple AI providers are creating search experiences of different kinds, and large language models power everything from ChatGPT’s web search to Google’s AI Overviews to specialized answer engines. It’s ensuring your content can inform and appear in AI-driven responses. Google’s Web Guide Labs is a good illustration of this.

Of course, search algorithms have always involved ML and tuned parameters. Even classical information retrieval (IR) algorithms like BM25 required parameter tuning for optimal results. Modern AI-powered search goes far beyond traditional ranking algorithms, incorporating complex language understanding and generation capabilities that fundamentally change how content gets discovered and presented.

How AI Models Actually Work (And Where Your Content Fits)

Understanding AIO requires knowing how these systems actually function.

Foundational Model Training

Large models are trained on massive web datasets to understand language, concepts, and factual relationships. The foundation model knows that “United” is a word in the dictionary, and an airline.

Common Crawl has played a critical role in AI development by providing open, transparent access to web data since 2007. Major models including GPT, BERT, and countless research projects have used Common Crawl data as foundational training material.

If your brand exists on the web, it’s likely to already be in these models, but the information might be outdated or incomplete.

Fine-Tuning

Models are then specialized for specific tasks using curated Q&A datasets. A travel chatbot gets fine-tuned on travel-specific conversations, while a customer service bot trains on support interactions.

Bonus: it’s worth looking at Google’s FAQPage structured data documentation, schema.org’s documentation on FAQPages, Questions, and Answers to understand more about how your FAQs are processed and surfaced to users searching for answers.

Real-Time Retrieval (or Retrieval-Augmented Generation, known as “RAG”)

Models access live information through web searches, databases, and specialized knowledge bases to provide current answers.

Your content needs to be accessible at all three levels to maximise visibility.

The CCBot Reality Check

Here’s where many SEOs discover an unexpected gap in their strategy: their robots.txt files.

A significant number of websites currently block CCBot (Common Crawl’s web crawler), often without realizing its role in the ML and research ecosystems. Common Crawl publishes monthly web datasets which serve as foundational training data for major AI models and research initiatives.

As one SEO Ash Nallawalla (Author of The Accidental SEO Manager) wrote:

“A manager asked me why our leading brand was not mentioned by an AI platform, which mentioned obscure competitors instead. I found that we had been blocking ccBot for some years, because some sites were scraping our content indirectly. After some discussion, we felt that allowing LLM crawlers was more beneficial than the risk of being scraped, so we revised our exclusion list.”

If CCBot can’t crawl your site, your content is absent from one of the key datasets on which AI models are trained, potentially making your brand less visible in AI-powered search results.

Network diagram showing a website connected to multiple search engines and AI platforms, illustrating modern search visibility strategy. — Modern search visibility requires connections across traditional search engines and AI-powered platforms.

What This Means for Your Strategy

Check your robots.txt file. Look for entries that disallow CCBot or use overly broad blocking rules.

Consider tradeoffs. Yes, there are legitimate concerns about content usage and AI training. The industry is actively working on standards to address these issues, with organizations like the IETF developing mechanisms for better handling of AI training consent and data usage (see the AIPREF Working Group’s mailing list to read the discussions there so far).

While those standards are being developed, blocking CCBot means potentially excluding yourself from how your audience increasingly discovers information.

Think strategically. AI optimization isn’t just about allowing crawlers. It’s about creating content that works well in AI contexts, developing structured data strategies, and potentially creating specialized datasets for AI training.

Simple Technical Steps

By default, CCBot interprets the absence of blocking rules as permission to crawl. If you decide inclusion makes sense for your strategy, you simply need to ensure that CCBot isn’t explicitly blocked.

You can also specifically allow it:

User-agent: CCBot
Allow: /

CCBot identifies clearly in logs, respects standardised Robots Exclusion Protocol directives, even the non-standard directive crawl-delay, and we are careful to operate it with full transparency.

Following the Crawler Best Practices draft by Gary Illyes, Common Crawl has implemented a ccbot.json file at https://index.commoncrawl.org/ccbot.json that provides information about the crawler’s IP ranges, and as recommended by Gary you can also find <link rel="help" href="https://index.commoncrawl.org/ccbot.json"> tags on the appropriate Common Crawl web pages, demonstrating our commitment to crawler transparency standards. You can also refer to our CCBot page for more tips on how to verify CCBot, such as via rDNS.

MERJ’s blog post explains why sites often unintentionally block valuable bots. Their live JSON endpoint which tracks bots provides JSON-based IP data and when it was last updated, including CCBot.

The Bigger Picture

AIO is becoming as crucial as traditional SEO. The businesses that figure this out early (understanding how to be visible, how to optimize for AI-powered search, and how to create content that works well in conversational interfaces) will have significant advantages.

The question isn’t whether Search 2.0 is coming, but whether you’ll be ready when your competitors are getting mentioned in AI responses and you’re not.

What's your experience been with AI-powered search and your brand visibility? We'd love to hear how you're thinking about these changes. Get in touch via our Contact Form, join our Google Group, or our Discord server.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

Originally reported by:

covuworie

The nodes in domain-level Web Graphs may not be properly sorted lexicographically by node label (reversed domain name). It's also possible that few nodes are duplicated, that is two nodes share the same label. For more details, see the Issue Report in the cc-webgraph repository.

The issue affects all domain-level Web Graphs until the issue has been fixed for the May, June/July, August 2022 Web Graph (cc-main-2022-may-jun-aug-domain) and the following Web Graph releases.

Erratum:

Content is truncated

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

The Data

Resources

Community

About