Common Crawl - Blog - Common Crawl Foundation at LREC 2026

Common Crawl Foundation at LREC 2026

The Common Crawl team attended the 16th International Conference on Language Resources and Evaluation in Palma, Mallorca, co-organizing a tutorial, presenting recent published work, and strengthening links with the research community.

Contributions by the Common Crawl Team

The Common Crawl team contributed a co-organized tutorial and co-authored two papers featured in the main programme.

Tutorial: Low-Resource, High-Impact — Building Corpora for Inclusive Language Technologies. Laurie Burchell and Pedro Ortiz Suarez from Common Crawl co-organized this tutorial together with Ekaterina Artemova, Daryna Dementieva, Shu Okabe, and Mariya Shmatova. The tutorial took participants through end-to-end NLP pipelines for underrepresented languages — from data collection and web crawling, through parallel sentence mining and machine translation, to downstream applications like text classification and multimodal reasoning. The materials cover more than ten languages from a range of language families and geopolitical contexts, with an emphasis on fair, reproducible, and community-informed practice. Tutorial paper: arXiv:2512.14576.

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Co-authored by Laurie Burchell with collaborators across the HPLT consortium, HPLT 3.0 is the latest release of one of the largest openly licensed multilingual datasets built on top of Common Crawl and Internet Archive data. The 3.0 release covers around 200 languages and reaches 30 trillion tokens, with a full pipeline from raw web archives through language identification, deduplication, and quality filtering to monolingual and parallel corpora ready for LLM and MT training. The HPLT corpora have already become a reference resource for multilingual model builders in Europe and beyond. Paper: arXiv:2511.01066.

SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing. Co-authored by Luca Foppiano, Pedro Ortiz Suarez, and Malte Ostendorff from Common Crawl together with colleagues at DFKI and partner institutions, SciLaD is a fully open dataset of scientific publications, comprising a curated English split of over 10 million papers and a multilingual TEI XML split covering more than 35 million publications. The construction pipeline relies entirely on open-source tooling — including Grobid for PDF processing and Datatrove for large-scale curation — and is released alongside the dataset to support reproducibility. The team also pre-trained a RoBERTa-base model on SciLaD and showed performance comparable to other scientific language models of similar size on standard benchmarks. Paper: arXiv:2512.11192, Datasets, Models, Code.

Submitted Works featuring Common Crawl

Common Crawl data, or papers using methodologies very close to ours, showed up in many places throughout the LREC 2026 programme. A selection of papers directly or indirectly relevant to our work:

GPT-NL Public Corpus: A Permissively Licensed, Dutch-First Dataset for LLM Pre-training (van Oort et al., 2026): The largest permissively licensed Dutch corpus to date, with 36 billion Dutch tokens, plus curated English, code, and German/Danish slices, much of it sourced from Common Crawl and Common Corpus. A concrete example of a national-scale LLM dataset built on top of our data, released on the Hugging Face Hub under CC-BY.

CLASSLA-web 2.0: A 17-Billion-Word Crawled Web Corpus of Seven South Slavic Languages (Kuzman Pungeršek et al., 2026): A new, large-scale web corpus for Bosnian, Bulgarian, Croatian, Macedonian, Montenegrin, Serbian, and Slovenian, built via TLD-level crawling. The paper also documents the growing presence of machine-generated content on top web domains — a finding directly relevant to ongoing work on data quality signals.

Dynaword: From One-shot to Continuously Developed Datasets (Enevoldsen et al., 2026): A framework for community-maintained, openly licensed datasets, instantiated as Danish Dynaword. The release contains over four times the tokens of comparable Danish corpora, is exclusively openly licensed, and has received contributions from companies, universities, and government institutions — a model very much in the spirit of Common Crawl's own open approach.

CoMMA: A Large-scale Corpus of Multilingual Medieval Archives (Clérice et al., 2026): A 2.5-billion-token corpus drawn from more than 23,000 digitized medieval manuscripts in Latin and Old French, harvested via IIIF. A web-scale harvest applied to cultural heritage rather than the contemporary web, with applications from corpus linguistics to historical-language model pretraining.

TestiMole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996–2024) (Rinaldi et al., 2026): A large-scale Italian web corpus drawn from discussion boards, motivated explicitly by the relatively low share of Italian in Common Crawl. Featured at the 12th Workshop on Challenges in the Management of Large Corpora (CMLC-12).

ManufactuBERT: Web Data for a Domain LLM (Armingaud and Besancon, 2026): A domain-adapted language model for the manufacturing sector trained on web data (Common Crawl via FineWeb), with a reported 33% training-time reduction from careful deduplication — a clean example of the impact of dedup on training efficiency.

Next Steps

LREC 2026 was a great opportunity to reconnect with the language-resources community, see how Common Crawl data is being put to use across a remarkable diversity of languages and domains, and surface new collaborations. We were especially glad to see how much work in the field is now openly licensed, openly documented, and openly shared — a trend that benefits everyone.

We look forward to attending more conferences in the coming months. One of the next stops is ACL 2026 in San Diego, where we will be presenting CommonLID, our open benchmark for language identification on web text — come find us there!

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Common Crawl Foundation at LREC 2026

Contributions by the Common Crawl Team

Submitted Works featuring Common Crawl

Next Steps

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use