Blog

The latest news, interviews, technologies, and resources.

Measuring Crawled Coverage of a Website in Common Crawl

How can we measure how many pages we’ve crawled from a particular website? The answer is a lot more complicated than you might think.

Greg Lindahl

Greg is Chief Technology Officer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

News

Measuring Crawled Coverage of a Website in Common Crawl

How can we measure how many pages we’ve crawled from a particular website? The answer is a lot more complicated than you might think.

Greg Lindahl

Greg is Chief Technology Officer at the Common Crawl Foundation.

Analysis

Is one vantage point enough? IPv6 across the top million web hosts

We probed the top 1,000,000 web hosts for IPv6 from five vantage points on three continents. 31.7% work from everywhere, one vantage point turns out to be enough for the headline rate, and 5,530 hosts reveal why it isn't enough for the rest of the story.

Measuring Crawled Coverage of a Website in Common Crawl

Measuring Crawled Coverage of a Website in Common Crawl

Is one vantage point enough? IPv6 across the top million web hosts

Turning 30,000 Arabic Domains Into a Better Crawl

Common Crawl Foundation at LREC 2026

13th Web-as-Corpus Workshop @ EMNLP 2026

Host- and Domain-Level Web Graphs April, May, and June 2026

June 2026 Crawl Archive Now Available

CommonLID Update: New Tools, Growing Impact

Common Crawl Foundation at IIPC-WAC 2026

The Columnar Index Is Now the URL Index

Introducing the AI Visibility Audit

Host- and Domain-Level Web Graphs March, April, and May 2026

May 2026 Crawl Archive Now Available

April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

You can now build directly on Common Crawl from the browser

Host- and Domain-Level Web Graphs February, March, and April 2026

April 2026 Crawl Archive Now Available

April 2026 Common Crawl Newsletter

Announcing a Change to Common Crawl Dataset Size Reporting

Host- and Domain-Level Web Graphs January, February, and March 2026

March 2026 Crawl Archive Now Available

IPv6 Adoption Across the Top 100K Web Hosts

Web Graph Statistics Gets a Proper Upgrade

Measuring Web Accessibility from Crawl Archives

Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java

Host- and Domain-Level Web Graphs December 2025 and January/February 2026

Introducing the New Examples & Resources Browser

February 2026 Crawl Archive Now Available

AI Plumbers at FOSDEM’26

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Host- and Domain-Level Web Graphs November/December 2025 and January 2026

January 2026 Crawl Archive Now Available

Web Archives for Social Sciences Datathon, Bristol

How SEOs Are Using Common Crawl's Web Graph Data for AI Ranking Signals

GneissWeb Annotations Examples

Common Crawl at the Mozilla Festival 2025

Host- and Domain-Level Web Graphs October, November, December 2025

December 2025 Crawl Archive Now Available

A Sampling of 2025 Research Referencing Common Crawl

Host- and Domain-Level Web Graphs September, October, and November 2025

November 2025 Crawl Archive Now Available

Common Crawl Celebrates World Digital Preservation Day

Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good

October/November 2025 Newsletter

Common Crawl Foundation at Stanford HAI

Host- and Domain-Level Web Graphs August, September, and October 2025

October 2025 Crawl Archive Now Available

Common Crawl Foundation at COLM 2025

Announcing GneissWeb Annotations

Web Languages Needing Review by Native Speakers

Host- and Domain-Level Web Graphs July, August, and September 2025

From SEO to AIO: Why Your Content Needs to Exist in AI Training Data

September 2025 Crawl Archive Now Available

Common Crawl Foundation Opt-Out Registry

Trip Report: AI_dev (Linux Foundation) August 2025

Common Crawl Foundation at Stanford HAI: A Shared Legacy of Data and Innovation

July/August 2025 Newsletter

Host- and Domain-Level Web Graphs June, July, and August 2025

August 2025 Crawl Archive Now Available

Common Crawl Foundation at ACL 2025

AI Optimization Is Here: Are You Ready for Search 2.0?

IETF 123 Report

Host- and Domain-Level Web Graphs May, June, and July 2025

July 2025 Crawl Archive Now Available

WMDQS Shared Task on Language Identification

The First WMDQS-Masakhane LangID Hackathon

Host- and Domain-Level Web Graphs April, May, and June 2025

Common Crawl at the United Nations Open Source Week, June 2025

June 2025 Crawl Archive Now Available

May/June 2025 Newsletter

Announcing the Whirlwind Tour of Common Crawl's Datasets using Python

Host- and Domain-Level Web Graphs March, April, and May 2025

May 2025 Crawl Archive Now Available

Announcing the First Workshop on Multilingual Data Quality Signals

Host- and Domain-Level Web Graphs February, March, and April 2025

April 2025 Crawl Archive Now Available

Introducing the Host Index

IIPC General Assembly & Web Archiving Conference 2025