December 11, 2024

Expanding the Language and Cultural Coverage of Common Crawl

We aim to enhance linguistic diversity in our dataset by inviting community contributions of non-English URLs and collaborating with MLCommons on a Language Identification campaign.

Pedro Ortiz Suarez

Pedro is a Principal Research Scientist at the Common Crawl Foundation.

At Common Crawl our mission has always been to make Open Web Data easily accessible for our users, so that they can benefit from high quality crawl data that was previously only available to large search engine corporations. However, from our own statistics, we know that our data has always been biased towards English content making our dataset difficult to use for individuals and organizations from smaller linguistic communities.

We have always wanted to make Common Crawl as representative as possible of the Open Web, so in recent months we have been working on some projects that we hope will allow us to expand the language and cultural coverage of our crawls, making it more representative of the actual linguistic and cultural diversity found on the web.

These projects will require input from the community, as our team is small and we speak but a handful of languages, and as we believe that the languages and the content written in them belong in the end to their respective linguistic communities.

The first initiative that we’re introducing today is the Web Languages Project. With this, we are asking speakers of Languages Other Than English (LOTE), to contribute URLs of websites that they know and that contain content written in their language. We will then inject these URLs into our seed crawl, which we hope will allow us to discover more web content written in these languages. We will of course respect Robots Exclusion Protocol directives, ensuring that all this new linguistic content that we will discover is crawled as politely as we have always crawled. If you want to contribute to this project please visit our GitHub Repository for more instructions.

The second initiative that we’re introducing is an annotation campaign for Language Identification (LID or LangID) that we will conduct in collaboration with MLCommons. In this annotation campaign we will ask participants to do simple LangID annotations on Common Crawl data. We would like to get as many annotations as possible and cover as many languages as possible, in order to create the first web-based LangID dataset. Our ultimate goal with this project is to train a small language classifier that would help us make better decisions at crawl time ensuring that we crawl data for as many languages as possible, so that our dataset will hopefully better reflect the vast cultural and linguistic diversity of the web. If you want to contribute and participate in our annotation campaign, please visit MLCommon’s Dynabench Platform, where you can already start annotating data today.

Dynabench interface showing highlighting of multilingual text — Interface in Dynabench

Finally, if you want to join the conversation about this project please join our Discord, there you will be able to share your feedback and engage with other contributors and community members.

This release was authored by:

No items found.

Erratum:

Content is truncated

Originally reported by:

More details

Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.

Expanding the Language and Cultural Coverage of Common Crawl

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use