November 3, 2025

October/November 2025 Newsletter

Check out our newsletter for October/November 2025, with updates on what we've been up to

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Event Highlights

Web Languages

GneissWeb Annotations

SEO to AIO

Common Crawl Opt-out Registry

IETF 124 Montréal

‍

Event Highlights

Common Crawl recently presented a seminar at the Stanford Institute for Human-Centered Artificial Intelligence (HAI) entitled “Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data”. For highlights and slides, see our blog post.

The WMDQS team at COLM: Sebastian Nagel, Pedro Ortiz Suarez, Laurie Burchell, Thom Vaughan, and Malte Ostendorff

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing the first Workshop on Multilingual Data Quality Signals (WMDQS) workshop, giving invited talks, and strengthening links with the research community. For more details and papers featuring Common Crawl, see our blog post.

Left-to-right: Thom Vaughan, Colin Ho, Sammy Sidhu, and Pedro Ortiz Suarez, at AI_dev in Amsterdam.

In late August our team attended the Linux Foundation’s AI_dev event in Amsterdam. We caught up with many familiar faces and made friends with plenty of new ones and heard from people working in the AI world who use our data regularly. For more on this event, see our trip report.

Web Languages

Web Languages, our public GitHub repository containing Markdown files for languages, asks native/proficient speakers to add URLs in the categories: news, culture and history, government, political parties, other, and informative links in English. Web Languages now has 5.535 URLs in 193 languages, thanks to community contributions. We are currently looking for native speakers to review contributions for 42 languages, as well as contributions for other languages.

GneissWeb Annotations

Common Crawl has added IBM’s GneissWeb quality and category annotations to its web dataset, enabling users to filter high-quality content and explore topics like medicine, education, and technology. For more details, see our blog post.

SEO to AIO

We have published a new post titled From SEO to AIO: Why Your Content Needs to Exist in AI Training Data, which is a follow-up to our post earlier this year on AI Optimization.

Common Crawl Opt-out Registry

Publishers have been sending Common Crawl legal opt-out requests. In the interest of transparency and to better serve our ecosystem, we are publishing the full opt-out list for every legal request we have received. For more details, please see our announcement post.

IETF 124 Montréal

This week Common Crawl will be represented at IETF 124 in Montréal, covering the AI Preferences and Web Bot Auth working groups, and presenting at the Measurement and Analysis for Protocols research group. We are excited to contribute to these conversations that shape the open standards which govern the web, and the future of access to online content. Our mission to provide open web data at large scale is aligned with the IETF’s goals of transparency, interoperability, and public benefit.

If you’re attending IETF 124 and would like to chat about web-scale data, open measurement, or responsible crawling, please get in touch, we’d love to meet.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Erratum:

Content is truncated

The Data

Resources

Community

About