August 13, 2025

Common Crawl Foundation at ACL 2025

The Common Crawl team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, presenting recent published work and strengthening links with the research community.

Laurie Burchell

Laurie is a Senior Research Engineer at the Common Crawl Foundation.

From 27 July to 1 August 2025, Laurie, Pedro, and Malte from Common Crawl’s engineering team attended the 63rd Annual Meeting of the Association of Computational Linguistics in Vienna, Austria. ACL is one of the biggest and most prestigious conferences in the field of natural language processing (NLP), with over 6,000 attendees and more than 3,000 accepted papers!

The programme featured keynote talks, oral presentations, poster sessions and social events, plus tutorials on the Sunday before the conference and two days of workshops directly afterwards. It was a great opportunity to learn about the latest work in NLP as well as to develop partnerships with the research community.

Left to right: Malte Ostendorff, Laurie Burchell, and Pedro Ortiz Suarez at ACL 2025 in Vienna

Research With and By Common Crawl

Many of the papers featured at ACL 2025 made use of Common Crawl’s data products, either indirectly through their use of large language models (LLMs) trained on our data, or directly as a key part of their research. To give just one example, in "Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset" (Su et al., 2025), the authors use our data as the basis for a high-quality English LLM training dataset, leveraging smart data filtering and synthetic rephrasing to improve downstream task performance. We also had a lot of very positive informal feedback from attendees: many told us about the value of Common Crawl’s open data and how it was a key part of making their research happen.

We were also very pleased to have three papers by Common Crawl team members presented at ACL!

Laurie Burchell with a poster on HPLT v2

"An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT)" (Burchell et al., 2025): presenting HPLT v2, a large-scale collection of high-quality multilingual monolingual and parallel corpora derived from Common Crawl and Internet Archive data.

Pedro Ortiz Suarez with a poster on mOSCAR

"mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus" (Futeral et al., 2025): introducing the mOSCAR dataset: the first large-scale multilingual and multimodal document corpus crawled from the web.

Malte, Pedro, and their DFKI colleagues with their award for “best paper runner up” at the 4th Table Representation Learning Workshop — Left to right: Malte Ostendorff, Pedro Ortiz Suarez, Ekaterina Borisova, Georg Rehm, Nils Feldhus, and Raia Abu Ahmad, with their award for “best paper runner up” at the 4th Table Representation Learning Workshop

"Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data" (Borisova et al., 2025): investigating the effectiveness of both text-based and multimodal LLMs on table understanding tasks through a cross-domain and cross-modality evaluation. This paper was the runner up for the best paper award at the Fourth Table Representation Learning Workshop! 🏆

Next Steps

We had a great time at ACL 2025, strengthening our connections within the NLP community and exploring the latest work in the field. We look forward to attending more conferences - come find us at the workshop we’re co-organising at COLM 2025, the First Workshop on Data Quality Signals!

This release was authored by:

Laurie Burchell

Laurie is a Senior Research Engineer at the Common Crawl Foundation.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Common Crawl Foundation at ACL 2025

Research With and By Common Crawl

Next Steps

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

Columnar Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use