Since October of 2024, we’ve been gathering URLs in languages other than English (or “LOTE” for short), which we have added to our “seed crawl”, with the aim of improving coverage of languages, communities, and cultures in our crawls. We’re doing this via our Web Languages Project (introduced in this blog post in December of last year), and so far we’ve had 266 contributions from 67 people, thanks to whom we’ve added over 4,700 LOTE URLs to our seed list so far.
Since August of 2018 we have used the Compact Language Detector 2 (CLD2) to annotate the language(s) in which a page is written. It’s able to identify 160 different languages (up to 3 languages per document) and uses the ISO 639-3 language code.
So far, there are 42 files in the Web Languages repository which need review by a native speaker (we’re counting Latin here, although of course lamentably there are no native speakers of Latin left) and out of these there are seven languages which CLD2 is not capable of recognising.
Languages contributions which need a review by a native speaker
Click a column header to sort.
Out of all of the contributors, we would like to thank Ethan Wenokur, Evan Pacini, Twan Goosen, and Swapnil Tripathi in particular. We’re very grateful to these people for their substantial contributions to the Web Languages project.

Erratum:
Content is truncated
Some archived content is truncated due to fetch size limits imposed during crawling. This is necessary to handle infinite or exceptionally large data streams (e.g., radio streams). Prior to March 2025 (CC-MAIN-2025-13), the truncation threshold was 1 MiB. From the March 2025 crawl onwards, this limit has been increased to 5 MiB.
For more details, see our truncation analysis notebook.