From 27 July to 1 August 2025, Laurie, Pedro, and Malte from Common Crawl’s engineering team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, Austria. ACL is one of the biggest and most prestigious conferences in the field of natural language processing (NLP), with over 6,000 attendees and more than 3,000 accepted papers!
The programme featured keynote talks, oral presentations, poster sessions and social events, plus tutorials on the Sunday before the conference and two days of workshops directly afterwards. It was a great opportunity to learn about the latest work in NLP as well as to develop partnerships with the research community.

Research With and By Common Crawl
Many of the papers featured at ACL 2025 made use of Common Crawl’s data products, either indirectly through their use of large language models (LLMs) trained on our data, or directly as a key part of their research. To give just one example, in "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" (Su et al., 2025), the authors use our data as the basis for a high-quality English LLM training dataset, leveraging smart data filtering and synthetic rephrasing to improve downstream task performance. We also had a lot of very positive informal feedback from attendees: many told us about the value of Common Crawl’s open data and how it was a key part of making their research happen.
We were also very pleased to have three papers by Common Crawl team members presented at ACL!

"An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)" (Burchell et al., 2025): presenting HPLT v2, a large-scale collection of high-quality multilingual monolingual and parallel corpora derived from Common Crawl and Internet Archive data.

"mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus" (Futeral et al., 2025): introducing the mOSCAR dataset: the first large-scale multilingual and multimodal document corpus crawled from the web.

"Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data" (Borisova et al., 2025): investigating the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. This paper was the runner up for the best paper award at the Fourth Table Representation Learning Workshop! 🏆
Next Steps
We had a great time at ACL 2025, strengthening our connections within the NLP community and exploring the latest work in the field. We look forward to attending more conferences - come find us at the workshop we’re co-organising at COLM 2025, the First Workshop on Data Quality Signals!

Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.