Blog

The latest news, interviews, technologies, and resources.

Host- and Domain-Level Web Graphs March, April, and May 2026

We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The graphs consist of 262.4 million nodes and 8.1 billion edges at the host level, and 118.8 million nodes and 4.3 billion edges at the domain level.

Michael Paris

Michael is a Senior Research Engineer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Crawl Release

August 2017 Crawl Archive Now Available

The crawl archive for August 2017 is now available! The archive contains 3.28 billion+ web pages and over 280 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Now Available: Host- and Domain-Level Web Graphs

We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017. These graphs, along with ranked lists of hosts and domains, follow on our first host-level web graph (February, March, April 2017).

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

July 2017 Crawl Archive Now Available

The crawl archive for July 2017 is now available! The archive contains 2.89 billion+ web pages and over 240 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

June 2017 Crawl Archive Now Available

The crawl archive for June 2017 is now available! The archive contains 3.16 billion+ web pages and over 260 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

May 2017 Crawl Archive Now Available

The crawl archive for May 2017 is now available! The archive contains 2.96 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Common Crawl's First In-House Web Graph

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

April 2017 Crawl Archive Now Available

The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

March 2017 Crawl Archive Now Available

The crawl archive for March 2017 is now available! The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

February 2017 Crawl Archive Now Available

The crawl archive for February 2017 is now available! The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

January 2017 Crawl Archive Now Available

The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

December 2016 Crawl Archive Now Available

The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

October 2016 Crawl Archive Now Available

The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

September 2016 Crawl Archive Now Available

The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

News Dataset Available

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Analysis

Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing robots.txt files and server responses with HTTP status code other than 200 (404s, redirects, etc.) The data may be useful to anyone interested in web science, with various applications in the field.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

August 2016 Crawl Archive Now Available

The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

News

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Host- and Domain-Level Web Graphs March, April, and May 2026

August 2017 Crawl Archive Now Available

Now Available: Host- and Domain-Level Web Graphs

July 2017 Crawl Archive Now Available

June 2017 Crawl Archive Now Available

May 2017 Crawl Archive Now Available

Common Crawl's First In-House Web Graph

April 2017 Crawl Archive Now Available

March 2017 Crawl Archive Now Available

February 2017 Crawl Archive Now Available

January 2017 Crawl Archive Now Available

December 2016 Crawl Archive Now Available

October 2016 Crawl Archive Now Available

September 2016 Crawl Archive Now Available

News Dataset Available

Data Sets Containing Robots.txt Files and Non-200 Responses

August 2016 Crawl Archive Now Available

July 2016 Crawl Archive Now Available

June 2016 Crawl Archive Now Available

May 2016 Crawl Archive Now Available

April 2016 Crawl Archive Now Available

Welcome, Sebastian!

February 2016 Crawl Archive Now Available

November 2015 Crawl Archive Now Available

September 2015 Crawl Archive Now Available

August 2015 Crawl Archive Available

Web Image Size Prediction for Efficient Focused Image Crawling

July 2015 Crawl Archive Available

June 2015 Crawl Archive Available

May 2015 Crawl Archive Available

April 2015 Crawl Archive Available

March 2015 Crawl Archive Available

Announcing the Common Crawl Index!

Evaluating graph computation systems

February 2015 Crawl Archive Available

5 Good Reads in Big Open Data: March 26 2015

5 Good Reads in Big Open Data: March 20 2015

5 Good Reads in Big Open Data: March 13 2015

5 Good Reads in Big Open Data: March 6 2015

January 2015 Crawl Archive Available

5 Good Reads in Big Open Data: February 27 2015

Analyzing a Web graph with 129 billion edges using FlashGraph

5 Good Reads in Big Open Data: Feb 20 2015

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

5 Good Reads in Big Open Data: Feb 13 2015

5 Good Reads in Big Open Data: Feb 6 2015

The Promise of Open Government Data & Where We Go Next

December 2014 Crawl Archive Available

November 2014 Crawl Archive Available

Please Donate To Common Crawl!

October 2014 Crawl Archive Available

September 2014 Crawl Archive Available

August 2014 Crawl Data Available

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

July 2014 Crawl Data Available

April 2014 Crawl Data Available

Navigating the WARC file format

March 2014 Crawl Data Now Available

Common Crawl's Move to Nutch

Lexalytics Text Analysis Work with Common Crawl Data

Winter 2013 Crawl Data Now Available

New Crawl Data Available!

Hyperlink Graph from Web Data Commons

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

A Look Inside Our 210TB 2012 Web Corpus

Professor Jim Hendler Joins the Common Crawl Advisory Board!

URL Search Tool!

The Winners of The Norvig Web Data Science Award

Analysis of the NCSU Library URLs in the Common Crawl Index

Common Crawl URL Index

blekko donates search data to Common Crawl

The Norvig Web Data Science Award

Towards Social Discovery - New Content Models; New Data; New Toolsets

Winners of the Code Contest!

Common Crawl Code Contest Extended Through the Holiday Weekend

TalentBin Adds Prizes To The Code Contest

Amazon Web Services sponsoring $50 in credit to all contest entrants!

Still time to participate in the Common Crawl code contest

Strata Conference + Hadoop World

Mat Kelcey Joins The Common Crawl Advisory Board