June 1, 2026

Introducing the AI Visibility Audit

A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

Stephen Burns

Stephen Burns is Web Intelligence Lead at the Common Crawl Foundation.

Over the past year I have been travelling and speaking to SEOs at conferences around the world, and the same question keeps coming up: why is a page that ranks well in Google still invisible to ChatGPT, Gemini, Claude, and Perplexity? I wrote this guide to answer it.

Today Common Crawl is publishing The AI Visibility Audit, a free field guide built for the SEOs and GEOs who are already doing this work and want a concrete framework rather than theory. It explains how AI systems actually discover content, why training-data inclusion behaves like a ranking factor, and how to run a repeatable, five-check audit using only free tools in about 90 minutes.

The reason a high-ranking page can go missing sits one layer upstream of everything we as SEOs usually audit. Before on-page work, before technical SEO, before link building, a site has to be reachable by the crawlers that feed AI training data. If it is not, the model never learns it exists.

The guide walks through how CCBot crawls the open web and publishes the archive that helps train modern LLMs, how harmonic centrality in the Common Crawl Web Graph sets crawl priority, why CDN and WAF defaults now silently block AI crawlers and training data crawlers, and why AI still leans toward English, with the English share of the latest crawl at roughly 41 percent.

The five checks move from the most decisive to the most strategic, and the results package into a one-page scorecard most agencies do not yet offer.

The old world was index and rank. The new world is train and retrieve. If you are not in the crawl, you are not in the model.

Read the guide, run the checks, and open the door.

This release was authored by:

Stephen Burns

Stephen Burns is Web Intelligence Lead at the Common Crawl Foundation.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Introducing the AI Visibility Audit

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use