Table of Contents
Event Updates
AI Action Summit + ROOST Launch
Submission to UK Copyright and AI Consultation
Common Crawl AI Agent by Ready AI
Language Updates
Event Updates
We have been busy participating in events this Winter and Spring. In February, we presented at HPLT Winter School, which had a focus this year on Pre-training Data Quality and Multilingual LLM Evaluation. Also in February, we attended the AI Action Summit (see separate post below).
Valyu x Common Crawl x UCL: AI Agents, Crawling and the Future of the Web was a co-hosted event in London this February to discuss AI-driven retrieval, web crawling for AI agents, AI preference signaling, opt-in/opt-out models, and related topics.
In March we attended SXSW in Austin, and hosted a networking social, attended others’ socials and events, and met up with many partners and friends of Common Crawl.

Also in March, we participated in a panel discussion on AI and blockchain with partner Constellation Network at the DC Blockchain Summit. Watch the complete panel discussion here, and learn more about Constellation Network’s launch at the summit of Digital Evidence product in our blog post.
In April, we attended the IIPC Web Archiving Conference. Stayed tuned for a full report in a separate blog post coming soon!
AI Action Summit + ROOST Launch
In February, Common Crawl attended the AI Action Summit in Paris, which saw the launch of several projects, standards, and partnerships. A coalition of major technology companies and foundations announced the launch of ROOST: Robust Online Open Safety Tools (https://roost.tools). ROOST makes critical data and tools for online safety openly accessible to benefit everyone; a mission which closely aligns with ours at Common Crawl. To learn more about the ROOST launch, please see our blog post.
Submission to UK Copyright and AI Consultation

Common Crawl made a submission to the UK Copyright and AI Consultation supporting a legal exception for text and data mining (TDM) while respecting creators’ rights. Read the full submission in our blog post.
Common Crawl AI Agent by Ready AI

We recently announced the launch of an experimental AI Agent, developed by our friends at ReadyAI. The agent offers a conversational interface designed to help users explore Common Crawl’s data, use cases, and community initiatives. Learn more about the agent in our blog post, and try it out here.
Language Updates
At the end of last year, we introduced two new language initiatives, LangID and web-languages. LangID, our annotation campaign for language identification in collaboration with MLCommons, now has over 600 contributions. Learn more and contribute to the LangID task here. Our web-languages project, in which we are asking speakers of Languages Other Than English (LOTE), to contribute URLs of websites that they know and that contain content written in their language, has had 21 pull requests merged since our last newsletter. Our web-languages GitHub repo has more details on contributing to the project.

Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.