August 2016 Crawl Archive Now Available

The crawl archive for August 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-36/ contains more than 1.61 billion web pages.

To extend the seed list, we’ve added 50 million hosts from the Common Search host-level pagerank data set. While many of these hosts may already be known, and some may not provide crawlable content, the number of crawled hosts has grown by 18 million (or 50%) and there are 8 million more unique domains (plus 35%).

Together with the August 2016 crawl archive we also release data sets containing robots.txt files and responses without content (404s, redirects, etc.). More information can be found in a separate blog post.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-36/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing

  • robots.txt files (or what servers return in response to a GET request /robots.txt)
  • server responses with HTTP status code other than 200 (404s, redirects, etc.)

The data may be useful to anyone interested in web science, with various applications in the field. For instance, redirects are substantial elements of web graphs where they are equivalent to ordinary links. The data may also be useful to people developing crawlers, as it enables testing of robots.txt parsers against a huge data set.

This data is provided separately from the crawl archive because it does not apply to data analysis for natural language content: robots.txt files are read by crawlers; and content generated together with 404s (and redirects, etc.) is usually auto-generated and contains only standardized phrases such as “page not found” or “document has moved”.

The new data sets are available as WARC files in subdirectories of the August 2016 crawl archives:

  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/robotstxt/ for the robots.txt responses, and
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/crawldiagnostics/ for 404s, redirects, etc.

Replace the star * by all segments to get the full list of folders. Alternatively, we provide lists of all robots.txt WARC files or all WARC files containing non-200 HTTP status code responses.

Please, share your feedback on this new data sets and let us know whether we shall continue and update this data sets regularly every month.

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-30/ contains more than 1.73 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-30/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-22/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-22/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

We are grateful to our friends at Moz for donating a seed list of 400 Million URL’s to enhance the Common Crawl. The seeds from Moz were used for the May crawl in addition to the seeds from the preceding crawl. Moz URL data will be incorporated into future crawls as well.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-18/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-18/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Note that the April crawl is based on the same URL seed list as the preceding crawl of February 2016. However, the manner in which the crawler follows redirects is changed: redirects are not followed immediately; instead redirect targets from the current crawl are recorded and followed by the subsequent crawl. This approach serves to avoid duplicates with exactly the same URL contained in multiple segments (e.g., one of the commoncrawl.org pages). The February crawl contains almost 10% such duplicates.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

Welcome, Sebastian!

It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!) to complement his role and the organization.

Sebastian has a PhD in Computational Linguistics and several years of experience as a programmer working in search and data. In addition to hands-on experience maintaining and improving a Nutch-based crawler like that of Common Crawl, Sebastian is a core committer to and current chair of the open-source Apache Nutch project. Sebastian’s knowledge of machine learning techniques and natural language processing components of web crawling will help Common Crawl continually improve on and optimize the crawl process and its results.

With Sebastian on board, we have both the competence and momentum to take Common Crawl to the next level.

February 2016 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available! This crawl archive holds more than 1.73 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2016-07/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-07/.

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data. Please contact [email protected] for sponsorship information and packages.

November 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available! This crawl archive is over 151TB in size and holds more than 1.82 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-48/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-48/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

September 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for September 2015 is now available! This crawl archive is over 106TB in size and holds more than 1.32 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-40/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-40/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.