Common Crawl - Blog - July/August 2025 Newsletter

Stanford HAI Seminar in October

Common Crawl Foundation is thrilled to present at an upcoming Stanford Institute for Human-Centered Artificial Intelligence (HAI) Seminar entitled Preserving Humanity's Knowledge and Making it Accessible: Addressing Challenges of Public Web Data.

Learn about Common Crawl's insights from a recent data product and informed solutions for the future of public web data. The seminar is Wednesday, October 22 from noon to 1:15 pm. For registration (in person and virtual) and more details please see the event listing. The Common Crawl team (including several of our engineers!) will be around for a few hours after the talk for followup chats.

Summer Event Highlights

The Common Crawl team attended the 63rd Annual Meeting of the Association of Computational Linguistics (ACL) in Vienna, presenting recent published work and strengthening links with the research community. More details about the event and links to papers with and about Common Crawl can be found in our recent blog post.

In July we had the happy opportunity to attend IETF 123, held at the Meliã Castilla in Madrid. As ever, the event was packed full of discussions, new draft proposals, and connections from the Internet protocol community. More details in our blog post.

And, back in June the Common Crawl Foundation team was in New York City for the United Nations Open Source Week, and several industry side-events. Over the course of the week we engaged with developers, researchers, and policymakers on all things related to Open Source and AI. For highlights from the week, see our blog post.

The First WMDQS-Masakhane LangID Hackathon

In June 2025 the Common Crawl Foundation, MLCommons, and EleutherAI had the pleasure of hosting a virtual hackathon in partnership with Masakhane in order to collect language identification annotations for African languages. For more about the hackathon as well as the Shared Task on Improving Language Identification for Web Text (to be held at COLM in October) see our blog post.

SEO to AIO, Search 1.0 to 2.0

Publishers and brands are shifting from SEO to AIO. Many SEOs unknowingly block their sites from AI search by restricting CCBot in robots.txt. As Search 2.0 transforms discovery, ensuring content can train AI models becomes as crucial as traditional SEO. Read our recent in-depth post on this topic here.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Erratum:

Content is truncated

The Data

Resources

Community

About