Blog

The latest news, interviews, technologies, and resources.

Filter by Category or Search by Title

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Common Crawl Blog

The Columnar Index Is Now the URL Index

June 3, 2026

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Read More...

Introducing the AI Visibility Audit

June 1, 2026

A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.

Read More...

Host- and Domain-Level Web Graphs March, April, and May 2026

May 29, 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The graphs consist of 262.4 million nodes and 8.1 billion edges at the host level, and 118.8 million nodes and 4.3 billion edges at the domain level.

Read More...

May 2026 Crawl Archive Now Available

May 25, 2026

We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content.

Read More...

April 2026 Crawl Archive Now Available in a Hugging Face Storage Bucket

May 20, 2026

As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.

Read More...

You can now build directly on Common Crawl from the browser

May 6, 2026

Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages.

Read More...

Host- and Domain-Level Web Graphs February, March, and April 2026

April 30, 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026. The graphs consist of 269.0 million nodes and 9.4 billion edges at the host level, and 124.6 million nodes and 4.8 billion edges at the domain level.

Read More...

April 2026 Crawl Archive Now Available

April 28, 2026

We are pleased to announce that the crawl archive for April 2026 is now available, containing 2.19 billion web pages or 379.2 TiB of uncompressed content.

Read More...

April 2026 Common Crawl Newsletter

April 6, 2026

Check out our newsletter for April 2026, with updates on what we've been up to.

Read More...

Announcing a Change to Common Crawl Dataset Size Reporting

April 1, 2026

Common Crawl is switching to reporting dataset sizes in nibbles. As an organisation dedicated to data preservation, we feel it would be remiss to allow this underrepresented unit to fall out of use. Our latest crawl now exceeds 689 tebibbles.

Read More...

Host- and Domain-Level Web Graphs January, February, and March 2026

March 24, 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026. The graphs consist of 270.2 million nodes and 9 billion edges at the host level, and 120 million nodes and 4.4 billion edges at the domain level.

Read More...

March 2026 Crawl Archive Now Available

March 19, 2026

We are pleased to announce the release of the March 2026 crawl, containing 1.97 billion web pages, or 344.64 TiB of uncompressed content. We also observed a dramatic increase in fetches over IPv6, explained by the enabling of Happy Eyeballs in the OkHttp library.

Read More...

IPv6 Adoption Across the Top 100K Web Hosts

March 16, 2026

We probed the 100,000 most-linked web hosts for IPv6 support using the Common Crawl Web Graph. Only 36.9% are fully reachable over IPv6, with adoption ranging from 71% among the top 100 to 32% in the long tail.

Read More...

Web Graph Statistics Gets a Proper Upgrade

March 6, 2026

Our Web Graph Statistics site has been updated with interactive charts, a domain lookup tool for tracking harmonic centrality and PageRank over time, mobile improvements, unified rank tables with OR filtering, and merged degree plots.

Read More...

Measuring Web Accessibility from Crawl Archives

March 2, 2026

A WCAG colour contrast audit of 240 top domains using Common Crawl's February 2026 archive finds four in ten colour pairings fall short of accessibility thresholds. Only one in five sites are fully compliant.

Read More...

Announcing the Whirlwind Tour of Common Crawl's Datasets Using Java

February 26, 2026

Introducing the second installment in our Whirlwind Tour series, covering crawl structure, index access, and content extraction, giving developers a practical foundation for building Java-based data workflows.

Read More...

Host- and Domain-Level Web Graphs December 2025 and January/February 2026

February 24, 2026

We're happy to announce the release of the Web Graphs for December 2025 and January/February 2026, consisting of 288.6 million nodes and 12.4 billion edges at the host level, and 134.2 million nodes and 5.4 billion edges at the domain level.

Read More...

Introducing the New Examples & Resources Browser

February 23, 2026

We've replaced our old Examples and Use Cases pages with a single searchable, filterable browser. 119 resources from 115 contributors, all in one place. Search, filter by type or language, sort, and share links. We welcome community submissions.

Read More...

February 2026 Crawl Archive Now Available

February 22, 2026

We are pleased to announce the release of the February 2026 crawl, consisting of 2.1 billion web pages (or 363 TiB of uncompressed content). Captures are from 45.5 million hosts or 37.1 million registered domains.

Read More...

AI Plumbers at FOSDEM’26

February 16, 2026

Common Crawl was invited to the AI Plumbers unconference held at FOSDEM this year. The contrast between the 100 people at the unconference, compared to the 10,000 people at the main event, couldn't be bigger.

Read More...

CC-Citations: A Visualization of Research Papers Referencing Common Crawl

February 11, 2026

We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data.

Read More...

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

February 10, 2026

We are excited to announce the release of CommonLID, a language identification benchmark for the web, covering 109 languages. CommonLID was developed in collaboration with multiple open-source organizations and language community groups.

Read More...

Host- and Domain-Level Web Graphs November/December 2025 and January 2026

January 30, 2026

The latest Web Graphs from the November and December 2025 and January 2026 crawls are now available, comprising 279.4 million host-level nodes with 13.4 billion edges, and 122.3 million domain-level nodes with 6.1 billion edges.

Read More...

January 2026 Crawl Archive Now Available

January 28, 2026

We are pleased to announce the release of the January 2026 crawl archive, containing 2.3 billion web pages, or 398 TiB of uncompressed content.

Read More...

Web Archives for Social Sciences Datathon, Bristol

January 26, 2026

Recently, a two-day Bristol datathon used Common Crawl web archives to analyse UK industries and policy, strengthening social science research through hands-on, team-based work.

Read More...