Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Common Crawl is a 501(c)(3) non–profit founded in 2007.
‍
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

Over 300 billion pages spanning 15 years.

Free and open corpus since 2007.

Cited in over 10,000 research papers.

3–5 billion new pages added each month.

Featured Papers

A new benchmark for web language identification

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, et al.

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Geolocating and embedding 50M German news articles for semantic analysis

Lukas Kriesch, Sebastian Losacker

A geolocated dataset of German news articles

A study on web crawlers facing inconsistent and poorly-signalled blocking

Mostafa Ansar, Anna Sperotto, Ralph Holz

Web Crawl Refusals: Insights From Common Crawl

Research on Free Expression Online

Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau

Banned Books: Analysis of Censorship on Amazon.com

The Dangers of Hijacked Hyperlinks

Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal

Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

Enhancing Computational Analysis

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Computation and Language

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

esCorpius: A Massive Spanish Crawling Corpus

The Web as a Graph (Master's Thesis)

Marius Løvold Jørgensen, UiT Norges Arktiske Universitet

BacklinkDB: A Purpose-Built Backlink Database Management System

More on Google Scholar Curated BibTeX Dataset

Latest Blog Post

The Columnar Index Is Now the URL Index

The Columnar Index Is Now the URL Index

We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format.

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.

Common Crawl Foundation

Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.