June 25, 2024

May/June 2024 Newsletter

We’re pleased to share our newsletter for May/June 2024, featuring the latest updates, events, and highlights from our community.

Greg Lindahl

Greg is Chief Technology Officer at the Common Crawl Foundation.

Common Crawl Celebrates Our 100th Crawl since 2008

Our latest crawl, May 2024, marks a milestone for Common Crawl – our 100th crawl since we began crawling in 2008! Many people have been involved in making this happen over the years, and we’d like to thank all of the emeritus members of our team: Ahad Rahna, Lisa Green, Allison Domicone, Jordan Mendelson, Stephen Merity, Julien Nioche, Sara Crouse, and Alex Xue. Thank you from all of the current members of our team!

AI and the Right to Learn on an Open Internet

Panel on the Risks to the Open Internet with Michael Weinberg, Cara Gagliano, Richard Gingras, and Michael Brawer. Moderated by Mike Masnick. — *Left to right: Michael Weinberg, Cara Gagliano, Richard Gingras, Michael Brawer, and Mike Masnick. Photo credit: Quinn Kowitt*

On April 30th, Common Crawl Foundation hosted an event in New York for a select group of leaders in AI, technology, media, and content. The conference, co-hosted with Professor Jeff Jarvis, was intended to foster an open dialogue between stakeholders about how to achieve a common goal of supporting a right to learn on an open Internet. The one-day event, held at the Craig Newmark Graduate School of Journalism at CUNY, featured opening remarks, firestarter mini-sessions, panel discussions, demo time, and networking opportunities. Topics of discussion ranging from the risks to the Open Internet and fair use and large language model training to smart uses of AI in journalism and business models and solutions. Sponsors of the conference were Kearney, Tola Capital, and CCIA.

Recent Research Using Common Crawl Data

Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains - WWW ‘24 Conference
When Online Content Disappears – Pew Research
Misinformation Resilient Search Rankings with Webgraph-based Interventions - CMU
Harmony in the Australian Domain Space - UT Sidney
WebGraph: The Next Generation (Is in Rust) - Inria
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Common Crawl's BibTeX index (recently updated)
Google Scholar search for Common Crawl since 2024

Updates to Our Data Products – Help Wanted!

Our summer intern, Ford Heilizer, has been hard at work making a tool that transforms our usual WARC/WAT/WET data into a table. If the first thing that you do when you download our data is to stick everything in a table, please contact us at info@commoncrawl.org. We'd love any advice you have to offer!

We are also thinking about a project to make a 1:1 round-trippable format of WARC to files in a ZIP, with the WARC metadata saved in spreadsheets. We hope this new format will be useful for users who want to process just a couple of WARCs worth of data on a laptop. If this interests you, please contact us!

We made a couple of small updates to our existing interfaces.

If you use the Web Graph, https://index.commoncrawl.org/graphinfo.json now contains the list of crawls in each Web Graph. If you use the cdx index, https://index.commoncrawl.org/collinfo.json now has 2 new fields, “from” and “to”, giving the exact dates when the crawling started and ended.

Volunteer for Common Crawl!

Common Crawl has had some significant contributions made by volunteers over the years, whether they’ve been technologists who love the data, people who have used the data and want to contribute some code as a result, or researchers who have written a paper and open sourced some code.

We also have a list of relatively simple tasks to get you started. Please contact us at info@commoncrawl.org if interested.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

May/June 2024 Newsletter

Table of Contents:

Common Crawl Celebrates Our 100th Crawl since 2008

AI and the Right to Learn on an Open Internet

Recent Research Using Common Crawl Data

Updates to Our Data Products – Help Wanted!

Volunteer for Common Crawl!

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use