July 21, 2025

WMDQS Shared Task on Language Identification

Note: this post has been marked as obsolete.

The Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing have the pleasure of inviting you to register for the 1st shared task on Language Identification for web data.

Pedro Ortiz Suarez

Pedro is a French-Colombian mathematician, computer scientist, and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

As part of the First Workshop on Multilingual Data Quality Signals (WMDQS), the Common Crawl Foundation, MLCommons, EleutherAI, and John Hopkins' Center for Language and Speech Processing are hosting the first shared task on Language Identification (LangID) for web data.

For this shared task participants are expected to submit LangID systems that work well on a wide variety of languages and on web data. We encourage participants to employ a range of approaches, including the development of new architectures and the curation of novel high-quality annotated datasets. To register, please submit a one-page document with a title, a list of authors, a list of provisional languages that you want to focus on, and a brief description of your approach. This document should be sent to wmdqs-pcs@googlegroups.com with the email subject "[Shared Task Abstract Submission]: Submission Title", before 11:59PM on the 23rd of July (Anywhere on Earth).

With this specific shared task our organizations expect to develop a new LangID system that is more robust, has better language coverage, specially for underrepresented languages and that is fast and efficient, so that it can be used to annotate large data collections.

The Common Crawl Foundation in particular expects to use and maintain the results of this shared task in order to improve the language and cultural coverage in our dataset. We also hope that we can maintain a new open source LangID solution in the long-term.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

Permalink

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

For more details, see our truncation analysis notebook.

WMDQS Shared Task on Language Identification

Erratum:

Content is truncated

The Data

Overview

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

Team

Jobs

Mission

Impact

Privacy Policy

Terms of Use