October 20, 2025

Common Crawl Foundation at COLM 2025

Note: this post has been marked as obsolete.

The Common Crawl team attended the 2nd Conference on Language Modeling in Montréal, organizing a workshop, giving invited talks, and strengthening links with the research community.

Malte Ostendorff

Malte is a Senior Research Engineer at Common Crawl, based in Berlin, Germany. He holds a Ph.D. in computer science from the University of Göttingen.

Five members of our research and engineering team including Laurie, Pedro, Sebastian, Thom, and Malte attended the 2nd Conference on Language Modeling in Montréal, also known as COLM 2025. The conference is part of a relatively new event series (second edition), specifically about language models and related topics. In contrast to other major AI or NLP conferences, COLM is still rather small with approximately 1,500 participants (doubled compared to the first edition) and features only a single track of talks and poster sessions.

The main conference program took place from October 7-9 with 16 diverse workshops following on October 10th. The Common Crawl team contributed invited talks and organized a workshop at the conference.

Contributions by the Common Crawl Team

Pedro, Laurie, and Thom from the Common Crawl team organized the 1st Workshop on Multilingual Data Quality Signals (WMDQS 🦆) together with colleagues from MLCommons, Johns Hopkins University, EleutherAI, and Factored. The workshop featured three keynotes by Julia Kreutzer (Cohere Labs), David Ifeoluwa Adelani (McGill University and Mila), and our own Sebastian who gave an overview of Common Crawl and our latest projects.

‍

David Ifeoluwa Adelani presenting at WMDQS

The workshop accepted 15 research papers for publication. The paper "Enhancing Multilingual LLM Pre-training with Model-Based Data Selection" by Messmer et al. received the best paper award. In addition to the accepted papers, the workshop also presented the results of the shared task on Improving Language Identification for Web Text.

Left to right: Sebastian Nagel (CCF), Laurie Burchell (CCF), Malte Ostendorff (CCF), Catherine Arnett (EleutherAI), Sarah Luger (iMerit/MLCommons), Sara Hincapié (Factored), Julia Kreutzer (Cohere Labs), Thom Vaughan (CCF), Pedro Ortiz Suarez (CCF), Rafael Mosquera (MLCommons), Kenton Murray (Johns Hopkins University)

Furthermore, Pedro gave an invited talk on “Expanding the Language and Cultural Coverage of Common Crawl” at the workshop on Multilingual and Equitable Language Technologies (MELT).

Submitted Works featuring Common Crawl

Common Crawl was featured in various papers at COLM, especially in the context of LLM datasets. Below are research papers directly or indirectly relevant to our work:

Can Performant LLMs Be Ethical? Quantifying the Impact of Web Crawling Opt-Outs (Fan et al., 2025): The paper studies how respecting web crawling opt-outs (robots.txt) affects LLM performance by introducing the concept of Data Compliance Gap (DCG), which quantifies the performance difference between models trained on datasets that comply with web crawling opt-outs, and those that do not. Their experiments with 1.5B models show that, as of January 2025, compliance with web data opt-outs does not degrade general knowledge acquisition (close to 0% DCG). However, in specialized domains such as biomedical research, excluding major publishers leads to performance declines. These findings suggest that while general-purpose LLMs can be trained to perform equally well using fully open data, performance in specialized domains may benefit from access to high-quality copyrighted sources later in training.

2 OLMo 2 Furious (COLM’s Version) (Walsh et al., 2025): The paper introduces AllenAI’s OLMo 2, a family of fully open 7B, 13B and 32B models achieving competitive performance at lower computational cost while providing transparency through released training data, code, checkpoints, and more. The updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, that relies on Common Crawl via DCLM (Li et al., 2024) and other open data sources such as Wikipedia or StarCoder.
FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language (Penedo et al., 2025): FineWeb2 is the multi-lingual continuation of FineWeb (Penedo et al., 2024) – a data curation pipeline that filters Common Crawl data for LLM pretraining. The curation pipeline has five key steps: text extraction from WARC files, language identification, deduplication, heuristic-based filtering, and rehydration (selective upsampling based on duplicates).

Rethinking Multilingual Continual Pre-training: Data Mixing for Adapting LLMs Across Languages and Resources (Li et al., 2025): This study evaluates 36 continual pretraining configurations across three multilingual LLMs and 30+ languages, analyzing their effects across resource levels and language behavior categories. In the experiments, Common Crawl is used as the training data via MADLAD-400 (Kudugunta et al., 2024).

MegaMath: Pushing the Limits of Open Math Corpora (Zhou et al., 2025): The paper introduces an open, high-quality mathematical corpus as an LLM training dataset. For Web data, the authors use Common Crawl as their data source. Specifically, they extract text and MathML from Common Crawl’s WARC files using modified Trafilatura and Resiliparse which support MathML extraction (code available on GitHub). The Web data is complemented by Math-related code and synthetic data.

EuroBERT: Scaling Multilingual Encoders for European Languages (Boizard et al., 2025): The paper presents a family of multilingual encoder models based on the BERT architecture for European and other widely spoken global languages. Common Crawl is used through FineWeb (Penedo et al., 2024) as training data for EuroBERT.
Sherkala-Chat: Building a State-of-the-Art LLM for Kazakh in a Moderately Resourced Setting (Koto et al., 2025): Sherkala-Chat is a state-of-the-art instruction-tuned generative language model designed for Kazakh and aims to enhance the inclusivity of LLM advancements for Kazakh speakers. Adapted from the LLaMA-3.1-8B, the model is trained on 45.3B tokens across Kazakh, English, Russian, and Turkish from Common Crawl and other data sources.

‍

We had a great time at COLM 2025, strengthening our connections within the research community and exploring the latest work in the field. We look forward to attending more conferences in the near future.

The WMDQS team at COLM: Sebastian Nagel, Pedro Ortiz Suarez, Laurie Burchell, Thom Vaughan, and Malte Ostendorff

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl Foundation at COLM 2025

Contributions by the Common Crawl Team

Submitted Works featuring Common Crawl

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use