June 16, 2026

CommonLID Update: New Tools, Growing Impact

CommonLID, a community-built language ID benchmark, has a new website and interactive leaderboard. Its paper was accepted to ACL 2026, with a poster session on 7 July. Source code, a PyPI package, and the dataset are now available.

Laurie Burchell

Laurie is a Principal Research Engineer at the Common Crawl Foundation.

Since our previous post announcing CommonLID, a new community-built language identification benchmark, the project has continued to grow. We’ve been focusing on two areas: improving usability and spreading the word.

A screenshot of the CommonLID website — A screenshot of the **CommonLID Website**

On the usability side, we now have a dedicated website for CommonLID: commonlid.org. You can find all the information and resources related to it here, and it’s where we’ll post any future updates. We’ve also made it easier to explore the state of the art for language identification through an interactive leaderboard. You can use it to compare performance on CommonLID by model and by language on a variety of metrics, all in the browser. If you’re looking to run your own evaluations, we’ve provided the source code and a PyPI package, which replicates the analysis in the paper and can be extended to other models and datasets.

A screenshot of the CommonLID leaderboard featuring 9 different LID systems

As for spreading the word about CommonLID, in April we found out that the paper has been accepted to the main conference at ACL 2026 in San Diego! ACL is one of the top conferences in natural language processing, and we’re looking forward to connecting with the community there. We’ll be presenting our work on the 7th of July in poster session G. Aside from the paper, Common Crawl team members have also given multiple talks featuring CommonLID. These include presentations for EleutherAI, Cohere Labs (on YouTube), the Mozilla Data Collective and others.

For more information, check out the CommonLID CommonLID Paper and dataset, now available both on Hugging Face and through the Mozilla Data Collective. We want to keep improving this resource, so if you’d like to contribute, please raise an issue on the Hugging Face repo or get in touch via Discord.

This release was authored by:

Laurie Burchell

Laurie is a Principal Research Engineer at the Common Crawl Foundation.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

CommonLID Update: New Tools, Growing Impact

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use