
Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for under-served languages.
In response to these challenges, we are excited to announce the first Workshop on Multilingual Data Quality Signals (WMDQS), which will be held in tandem with COLM 2025. Common Crawl will be hosting this workshop in collaboration with MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing.

For the WMDQS workshop, we invite the submission of long and short research papers related to data quality in multilingual data. Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit many research communities. We therefore encourage participants from a diverse range of disciplines such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond.
WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.
The deadline for submissions is June 23 2025 (AoE) and the workshop itself will take place on October 10 2025. To learn more, please take a look at our Call for Papers!


Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.