Common Crawl - Blog - March/April 2025 Newsletter

Event Updates

We have been busy participating in events this Winter and Spring. In February, we presented at HPLT Winter School, which had a focus this year on Pre-training Data Quality and Multilingual LLM Evaluation. Also in February, we attended the AI Action Summit (see separate post below).

Valyu x Common Crawl x UCL: AI Agents, Crawling and the Future of the Web was a co-hosted event in London this February to discuss AI-driven retrieval, web crawling for AI agents, AI preference signaling, opt-in/opt-out models, and related topics.

In March we attended SXSW in Austin, and hosted a networking social, attended others’ socials and events, and met up with many partners and friends of Common Crawl.

Left-to-right: Benjamin Diggles, Chris Tolles, Erik Bethel, Aidan Clifford, speaking at the Digital Chamber Summit in Washington DC in 2025

Also in March, we participated in a panel discussion on AI and blockchain with partner Constellation Network at the DC Blockchain Summit. Watch the complete panel discussion here, and learn more about Constellation Network’s launch at the summit of Digital Evidence product in our blog post.

In April, we attended the IIPC Web Archiving Conference. Stayed tuned for a full report in a separate blog post coming soon!

AI Action Summit + ROOST Launch

In February, Common Crawl attended the AI Action Summit in Paris, which saw the launch of several projects, standards, and partnerships. A coalition of major technology companies and foundations announced the launch of ROOST: Robust Online Open Safety Tools (https://roost.tools). ROOST makes critical data and tools for online safety openly accessible to benefit everyone; a mission which closely aligns with ours at Common Crawl. To learn more about the ROOST launch, please see our blog post.

Submission to UK Copyright and AI Consultation

The frontispiece of Ted Nelson's Computer Lib/Dream Machines (1974). Original image by John R. Neill for L. Frank Baum's Tik-tok of Oz (1914).

Common Crawl made a submission to the UK Copyright and AI Consultation supporting a legal exception for text and data mining (TDM) while respecting creators’ rights. Read the full submission in our blog post.

Common Crawl AI Agent by Ready AI

Two logos side-by-side, ReadyAI and Common Crawl

Announcing the launch of an experimental AI Agent, developed by our friends at ReadyAI

We recently announced the launch of an experimental AI Agent, developed by our friends at ReadyAI. The agent offers a conversational interface designed to help users explore Common Crawl’s data, use cases, and community initiatives. Learn more about the agent in our blog post, and try it out here.

Language Updates

At the end of last year, we introduced two new language initiatives, LangID and web-languages. LangID, our annotation campaign for language identification in collaboration with MLCommons, now has over 600 contributions. Learn more and contribute to the LangID task here. Our web-languages project, in which we are asking speakers of Languages Other Than English (LOTE), to contribute URLs of websites that they know and that contain content written in their language, has had 21 pull requests merged since our last newsletter. Our web-languages GitHub repo has more details on contributing to the project.

A chart showing modified language files in our web-languages repository

‍

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

March/April 2025 Newsletter

Table of Contents

Event Updates

AI Action Summit + ROOST Launch

Submission to UK Copyright and AI Consultation

Common Crawl AI Agent by Ready AI

Language Updates

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use