Blog

The latest news, interviews, technologies, and resources.

Measuring Crawled Coverage of a Website in Common Crawl

How can we measure how many pages we’ve crawled from a particular website? The answer is a lot more complicated than you might think.

Greg Lindahl

Greg is Chief Technology Officer at the Common Crawl Foundation.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Crawl Release

April 2018 Crawl Archive Now Available

The crawl archive for April 2018 is now available! The archive contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

March 2018 Crawl Archive Now Available

The crawl archive for March 2018 is now available! The archive contains 3.2 billion web pages and 250+ TiB of uncompressed content, crawled between March 17th and 25th.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

February 2018 Crawl Archive Now Available

The crawl archive for February 2018 is now available! The archive contains 3.4 billion web pages and 270+ TiB of uncompressed content, crawled between February 17th and Feb 26th.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

News

Index to WARC Files and URLs in Columnar Format

We're happy to announce the release of an index to WARC files and URLs in a columnar format. The columnar format (we use Apache Parquet) allows to efficiently query or process the index and saves time and computing resources. Especially, if only few columns are accessed, recent big data tools will run impressively fast.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017).

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

January 2018 Crawl Archive Now Available

The crawl archive for January 2018 is now available! The archive contains 3.4 billion web pages and 270 TiB of uncompressed content, crawled between January 16th and Jan 24th.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

December 2017 Crawl Archive Now Available

The crawl archive for December 2017 is now available! The archive contains 2.9 billion web pages and over 240 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

November 2017 Crawl Archive Now Available

The crawl archive for November 2017 is now available! The archive contains 3.2 billion web pages and 260 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017. These graphs, along with ranked lists of hosts and domains, follow the first (February, March, April 2017) and second (May, June, July 2017) web graph releases.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

October 2017 Crawl Archive Now Available

The crawl archive for October 2017 is now available! The archive contains 3.65 billion web pages and over 300 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

September 2017 Crawl Archive Now Available

The crawl archive for September 2017 is now available! The archive contains 3.01 billion web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

August 2017 Crawl Archive Now Available

The crawl archive for August 2017 is now available! The archive contains 3.28 billion+ web pages and over 280 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Now Available: Host- and Domain-Level Web Graphs

We are pleased to announce the release of host-level and domain-level web graphs based on the published crawls of May, June, and July 2017. These graphs, along with ranked lists of hosts and domains, follow on our first host-level web graph (February, March, April 2017).

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

July 2017 Crawl Archive Now Available

The crawl archive for July 2017 is now available! The archive contains 2.89 billion+ web pages and over 240 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

June 2017 Crawl Archive Now Available

The crawl archive for June 2017 is now available! The archive contains 3.16 billion+ web pages and over 260 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

May 2017 Crawl Archive Now Available

The crawl archive for May 2017 is now available! The archive contains 2.96 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Web Graphs

Common Crawl's First In-House Web Graph

We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

April 2017 Crawl Archive Now Available

The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

March 2017 Crawl Archive Now Available

The crawl archive for March 2017 is now available! The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

February 2017 Crawl Archive Now Available

The crawl archive for February 2017 is now available! The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

January 2017 Crawl Archive Now Available

The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

December 2016 Crawl Archive Now Available

The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

October 2016 Crawl Archive Now Available

The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

September 2016 Crawl Archive Now Available

The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

News Dataset Available

We are pleased to announce the release of a new dataset containing news articles from news sites all over the world.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Analysis

Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing robots.txt files and server responses with HTTP status code other than 200 (404s, redirects, etc.) The data may be useful to anyone interested in web science, with various applications in the field.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

August 2016 Crawl Archive Now Available

The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

Crawl Release

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive.

Sebastian Nagel

Sebastian is a Distinguished Engineer at the Common Crawl Foundation.

News

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Measuring Crawled Coverage of a Website in Common Crawl

April 2018 Crawl Archive Now Available

March 2018 Crawl Archive Now Available

February 2018 Crawl Archive Now Available

Index to WARC Files and URLs in Columnar Format

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

January 2018 Crawl Archive Now Available

December 2017 Crawl Archive Now Available

November 2017 Crawl Archive Now Available

Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

October 2017 Crawl Archive Now Available

September 2017 Crawl Archive Now Available

August 2017 Crawl Archive Now Available

Now Available: Host- and Domain-Level Web Graphs

July 2017 Crawl Archive Now Available

June 2017 Crawl Archive Now Available

May 2017 Crawl Archive Now Available

Common Crawl's First In-House Web Graph

April 2017 Crawl Archive Now Available

March 2017 Crawl Archive Now Available

February 2017 Crawl Archive Now Available

January 2017 Crawl Archive Now Available

December 2016 Crawl Archive Now Available

October 2016 Crawl Archive Now Available

September 2016 Crawl Archive Now Available

News Dataset Available

Data Sets Containing Robots.txt Files and Non-200 Responses

August 2016 Crawl Archive Now Available

July 2016 Crawl Archive Now Available

June 2016 Crawl Archive Now Available

May 2016 Crawl Archive Now Available

April 2016 Crawl Archive Now Available

Welcome, Sebastian!

February 2016 Crawl Archive Now Available

November 2015 Crawl Archive Now Available

September 2015 Crawl Archive Now Available

August 2015 Crawl Archive Available

Web Image Size Prediction for Efficient Focused Image Crawling

July 2015 Crawl Archive Available

June 2015 Crawl Archive Available

May 2015 Crawl Archive Available

April 2015 Crawl Archive Available

March 2015 Crawl Archive Available

Announcing the Common Crawl Index!

Evaluating graph computation systems

February 2015 Crawl Archive Available

5 Good Reads in Big Open Data: March 26 2015

5 Good Reads in Big Open Data: March 20 2015

5 Good Reads in Big Open Data: March 13 2015

5 Good Reads in Big Open Data: March 6 2015

January 2015 Crawl Archive Available

5 Good Reads in Big Open Data: February 27 2015

Analyzing a Web graph with 129 billion edges using FlashGraph

5 Good Reads in Big Open Data: Feb 20 2015

WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

5 Good Reads in Big Open Data: Feb 13 2015

5 Good Reads in Big Open Data: Feb 6 2015

The Promise of Open Government Data & Where We Go Next

December 2014 Crawl Archive Available

November 2014 Crawl Archive Available

Please Donate To Common Crawl!

October 2014 Crawl Archive Available

September 2014 Crawl Archive Available

August 2014 Crawl Data Available

Web Data Commons Extraction Framework for the Distributed Processing of CC Data

July 2014 Crawl Data Available

April 2014 Crawl Data Available

Navigating the WARC file format

March 2014 Crawl Data Now Available

Common Crawl's Move to Nutch

Lexalytics Text Analysis Work with Common Crawl Data

Winter 2013 Crawl Data Now Available

New Crawl Data Available!

Hyperlink Graph from Web Data Commons

Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

A Look Inside Our 210TB 2012 Web Corpus

Professor Jim Hendler Joins the Common Crawl Advisory Board!

URL Search Tool!

The Winners of The Norvig Web Data Science Award

Analysis of the NCSU Library URLs in the Common Crawl Index