Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Common Crawl is a 501(c)(3) non–profit founded in 2007.
‍
We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, Daya Guo

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Computation and Language

Asier Gutiérrez-Fandiño, David Pérez-Fernández, Jordi Armengol-Estapé, David Griol, Zoraida Callejas

esCorpius: A Massive Spanish Crawling Corpus

The Web as a Graph (Master's Thesis)

Marius Løvold Jørgensen, UiT Norges Arktiske Universitet

BacklinkDB: A Purpose-Built Backlink Database Management System

Internet Security: Phishing Websites

Asadullah Safi, Satwinder Singh

A Systematic Literature Review on Phishing Website Detection Techniques

More on Google Scholar Curated BibTeX Dataset

Latest Blog Post:

Web Graphs

Host- and Domain-Level Web Graphs February, March, and April 2025

We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025. The graph consists of 309.2 million nodes and 2.9 billion edges at the host level, and 157.1 million nodes and 2.1 billion edges at the domain level.

Thom Vaughan

Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Common Crawl is a 501(c)(3) non–profit founded in 2007.‍We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

Over 250 billion pages spanning 15 years.

Free and open corpus since 2007.

Cited in over 10,000 research papers.

3–5 billion new pages added each month.

Featured Papers:

Research on Free Expression Online

Jeffrey Knockel, Jakub Dalek, Noura Aljizawi, Mohamed Ahmed, Levi Meletti, and Justin Lau

Banned Books: Analysis of Censorship on Amazon.com

Analyzing the Australian Web with Web Graphs: Harmonic Centrality at the Domain Level

Xian Gong, Paul X. McCarthy, Marian-Andrei Rizoiu, Paolo Boldi

Harmony in the Australian Domain Space

The Dangers of Hijacked Hyperlinks

Kevin Saric, Felix Savins, Gowri Sankar Ramachandran, Raja Jurdak, Surya Nepal

Hyperlink Hijacking: Exploiting Erroneous URL Links to Phantom Domains

Enhancing Computational Analysis