Common Crawl - Blog - Web Languages Needing Review by Native Speakers

Since October of 2024, we’ve been gathering URLs in languages other than English (or “LOTE” for short), which we have added to our “seed crawl”, with the aim of improving coverage of languages, communities, and cultures in our crawls. We’re doing this via our Web Languages Project (introduced in this blog post in December of last year), and so far we’ve had 266 contributions from 67 people, thanks to whom we’ve added over 4,700 LOTE URLs to our seed list so far.

Since August of 2018 we have used the Compact Language Detector 2 (CLD2) to annotate the language(s) in which a page is written. It’s able to identify 160 different languages (up to 3 languages per document) and uses the ISO 639-3 language code.

So far, there are 42 files in the Web Languages repository which need review by a native speaker (we’re counting Latin here, although of course lamentably there are no native speakers of Latin left) and out of these there are seven languages which CLD2 is not capable of recognising.

Languages contributions which need a review by a native speaker

Language	ISO 639-3 Code	Recognised by CLD2?	Coverage in CC-MAIN-2025-38	Link to contribute
Achinese	`ace`	no	n/a	living/achinese.md
Albanian	`sqi`	yes	0.0474	living/albanian.md
Basque	`eus`	yes	0.0306	living/basque.md
Bosnian	`bos`	yes	0.0557	living/bosnian.md
Buginese	`bug`	no	n/a	living/buginese.md
Catalan	`cat`	yes	0.1741	living/catalan.md
Chokwe	`cjk`	no	n/a	living/chokwe.md
Cornish	`cor`	no	n/a	living/cornish.md
Croatian	`hrv`	yes	0.2081	living/croatian.md
Danish	`dan`	yes	0.4432	living/danish.md
Estonian	`est`	yes	0.1170	living/estonian.md
Faroese	`fao`	yes	0.0045	living/faroese.md
Galician	`glg`	yes	0.0279	living/galician.md
Icelandic	`isl`	yes	0.0415	living/icelandic.md
Irish	`gle`	yes	0.0069	living/irish.md
Japanese	`jpn`	yes	5.2018	living/japanese.md
Kalaallisut	`kal`	yes	0.0009	living/kalaallisut.md
Korean	`kor`	yes	0.7754	living/korean.md
Latin	`lat`	yes	0.0983	historical/latin.md
Lithuanian	`lit`	yes	0.1601	living/lithuanian.md
Luxembourgish	`ltz`	yes	0.0040	living/luxembourgish.md
Macedonian	`mkd`	yes	0.0375	living/macedonian.md
Maltese	`mlt`	yes	0.0036	living/maltese.md
Mandarin Chinese	`cmn`	no	n/a	living/mandarin_chinese.md
Maori	`mri`	yes	0.0014	living/maori.md
Norwegian	`nor`	yes	0.3213	living/norwegian.md
Panjabi	`pan`	yes	0.0074	living/panjabi.md
Polish	`pol`	yes	1.6602	living/polish.md
Portuguese	`por`	yes	2.0696	living/portuguese.md
Romansh	`roh`	yes	0.0011	living/romansh.md
Russian	`rus`	yes	6.1083	living/russian.md
Sardinian	`srd`	no	n/a	living/sardinian.md
Scottish Gaelic	`gla`	yes	0.0014	living/scottish_gaelic.md
Serbian	`srp`	yes	0.2053	living/serbian.md
Slovak	`slk`	yes	0.3853	living/slovak.md
Slovenian	`slv`	yes	0.1264	living/slovenian.md
Thai	`tha`	yes	0.3842	living/thai.md
Uighur	`uig`	yes	0.0012	living/uighur.md
Walloon	`wln`	no	n/a	living/walloon.md
Welsh	`cym`	yes	0.0094	living/welsh.md
Western Frisian	`fry`	yes	0.0032	living/western_frisian.md
Yiddish	`yid`	yes	0.0019	living/yiddish.md

Out of all of the contributors, we would like to thank Ethan Wenokur, Evan Pacini, Twan Goosen, and Swapnil Tripathi in particular. We’re very grateful to these people for their substantial contributions to the Web Languages project.

Web Languages Needing Review by Native Speakers

Languages contributions which need a review by a native speaker

Erratum:

Content is truncated

The Data

Overview

CDXJ Index

URL Index

Web Graphs

Latest Crawl

Crawl Stats

Graph Stats

Errata

Resources

Get Started

AI Agent

Blog

Examples

CCBot

Infra Status

Opt-Out Registry

FAQ

Community

Research Papers

Mailing List Archive

Hugging Face

Discord

Collaborators

About

About

Team

Jobs

Privacy Policy

Terms of Use