May 27, 2025

Announcing the First Workshop on Multilingual Data Quality Signals

Note: this post has been marked as obsolete.

The first Workshop on Multilingual Data Quality Signals (WMDQS), hosted by Common Crawl with MLCommons, EleutherAI, and Johns Hopkins, will be held alongside COLM 2025 on 10 October 2025 in Montreal, Canada. It invites research papers on multilingual data quality and offers a shared task on language identification for web text.

Laurie Burchell

Laurie is a Senior Research Engineer with Common Crawl.

Recent research has shown that large language models (LLMs) not only need large quantities of data, but also need data of sufficient quality. Ensuring data quality is even more important in a multilingual setting, where the amount of acceptable training data in many languages is limited. Indeed, for many languages even the fundamental step of language identification remains a challenge, leading to unreliable language labels and thus noisy datasets for under-served languages.

In response to these challenges, we are excited to announce the first Workshop on Multilingual Data Quality Signals (WMDQS), which will be held in tandem with COLM 2025. Common Crawl will be hosting this workshop in collaboration with MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing.

‍

For the WMDQS workshop, we invite the submission of long and short research papers related to data quality in multilingual data. Even though most previous work on data quality has been targeted at LLM development, we believe that research in this area can also benefit many research communities. We therefore encourage participants from a diverse range of disciplines such as web search, web archiving, corpus linguistics, digital humanities, political sciences and beyond.

WMDQS will also include a shared task on language identification for web text. We invite participants to submit novel systems which address current problems with language identification for web text. We will provide a training set of annotated documents sourced from Common Crawl to aid development.

The deadline for submissions is June 23 2025 (AoE) and the workshop itself will take place on October 10 2025. To learn more, please take a look at our Call for Papers!

This release was authored by:

Laurie is a Senior Research Engineer with Common Crawl.

Laurie Burchell

Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

Pedro Ortiz Suarez

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

Originally reported by:

covuworie

The nodes in domain-level Web Graphs may not be properly sorted lexicographically by node label (reversed domain name). It's also possible that few nodes are duplicated, that is two nodes share the same label. For more details, see the Issue Report in the cc-webgraph repository.

The issue affects all domain-level Web Graphs until the issue has been fixed for the May, June/July, August 2022 Web Graph (cc-main-2022-may-jun-aug-domain) and the following Web Graph releases.

Announcing the First Workshop on Multilingual Data Quality Signals

Erratum:

Content is truncated

Erratum:

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use