Common Crawl - Open Repository of Web Crawl Data

Latest Blog Post

Host- and Domain-Level Web Graphs May, June, and July 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June, and July 2026, consisting of 240.4 million nodes and 3.7 billion edges at the host level, and 118.0 million nodes and 2.8 billion edges at the domain level.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Common Crawl is a 501(c)(3) non–profit founded in 2007.‍We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

Over 300 billion pages spanning 15 years.

Free and open corpus since 2007.

Cited in over 10,000 research papers.

3–5 billion new pages added each month.

Featured Papers

Julio Garbers, Terry Gregory

The Diffusion of Artificial Intelligence Across Firms: Evidence from Europe

Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, et al.

CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data

Lukas Kriesch, Sebastian Losacker

A geolocated dataset of German news articles

Mostafa Ansar, Anna Sperotto, Ralph Holz

Web Crawl Refusals: Insights From Common Crawl

Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau

Banned Books: Analysis of Censorship on Amazon.com

Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal

Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

esCorpius: A Massive Spanish Crawling Corpus

Latest Blog Post

Host- and Domain-Level Web Graphs May, June, and July 2026

The Data

Resources

Community

About

Common Crawl is a 501(c)(3) non–profit founded in 2007.
‍
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.