< Back to Blog
December 11, 2024

Expanding the Language and Cultural Coverage of Common Crawl

Note: this post has been marked as obsolete.
We aim to enhance linguistic diversity in our dataset by inviting community contributions of non-English URLs and collaborating with MLCommons on a Language Identification campaign.
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Pedro is a French-Colombian mathematician, computer scientist and researcher. He holds a PhD in computer science and Natural Language Processing from Sorbonne Université.

At Common Crawl our mission has always been to make Open Web Data easily accessible for our users, so that they can benefit from high quality crawl data that was previously only available to large search engine corporations. However, from our own statistics, we know that our data has always been biased towards English content making our dataset difficult to use for individuals and organizations from smaller linguistic communities.

We have always wanted to make Common Crawl as representative as possible of the Open Web, so in recent months we have been working on some projects that we hope will allow us to expand the language and cultural coverage of our crawls, making it more representative of the actual linguistic and cultural diversity found on the web.

These projects will require input from the community, as our team is small and we speak but a handful of languages, and as we believe that the languages and the content written in them belong in the end to their respective linguistic communities.

The first initiative that we’re introducing today is the Web Languages Project. With this, we are asking speakers of Languages Other Than English (LOTE), to contribute URLs of websites that they know and that contain content written in their language. We will then inject these URLs into our seed crawl, which we hope will allow us to discover more web content written in these languages. We will of course respect Robots Exclusion Protocol directives, ensuring that all this new linguistic content that we will discover is crawled as politely as we have always crawled. If you want to contribute to this project please visit our GitHub Repository for more instructions.

The second initiative that we’re introducing is an annotation campaign for Language Identification (LID or LangID) that we will conduct in collaboration with MLCommons. In this annotation campaign we will ask participants to do simple LangID annotations on Common Crawl data. We would like to get as many annotations as possible and cover as many languages as possible, in order to create the first web-based LangID dataset. Our ultimate goal with this project is to train a small language classifier that would help us make better decisions at crawl time ensuring that we crawl data for as many languages as possible, so that our dataset will hopefully better reflect the vast cultural and linguistic diversity of the web. If you want to contribute and participate in our annotation campaign, please visit MLCommon’s Dynabench Platform, where you can already start annotating data today.

Dynabench interface showing highlighting of multilingual text
Interface in Dynabench

Finally, if you want to join the conversation about this project please join our Discord, there you will be able to share your feedback and engage with other contributors and community members.

This release was authored by:
No items found.