Data Sets Containing Robots.txt Files and Non-200 Responses

Together with the crawl archive for August 2016 we release two data sets containing

  • robots.txt files (or what servers return in response to a GET request /robots.txt)
  • server responses with HTTP status code other than 200 (404s, redirects, etc.)

The data may be useful to anyone interested in web science, with various applications in the field. For instance, redirects are substantial elements of web graphs where they are equivalent to ordinary links. The data may also be useful to people developing crawlers, as it enables testing of robots.txt parsers against a huge data set.

This data is provided separately from the crawl archive because it does not apply to data analysis for natural language content: robots.txt files are read by crawlers; and content generated together with 404s (and redirects, etc.) is usually auto-generated and contains only standardized phrases such as “page not found” or “document has moved”.

The new data sets are available as WARC files in subdirectories of the August 2016 crawl archives:

  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/robotstxt/ for the robots.txt responses, and
  • s3://commoncrawl/crawl-data/CC-MAIN-2016-36/segments/*/crawldiagnostics/ for 404s, redirects, etc.

Replace the star * by all segments to get the full list of folders. Alternatively, we provide lists of all robots.txt WARC files or all WARC files containing non-200 HTTP status code responses.

Please, share your feedback on this new data sets and let us know whether we shall continue and update this data sets regularly every month.

July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-30/ contains more than 1.73 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-30/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-22/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-22/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

We are grateful to our friends at Moz for donating a seed list of 400 Million URL’s to enhance the Common Crawl. The seeds from Moz were used for the May crawl in addition to the seeds from the preceding crawl. Moz URL data will be incorporated into future crawls as well.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-18/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-18/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Note that the April crawl is based on the same URL seed list as the preceding crawl of February 2016. However, the manner in which the crawler follows redirects is changed: redirects are not followed immediately; instead redirect targets from the current crawl are recorded and followed by the subsequent crawl. This approach serves to avoid duplicates with exactly the same URL contained in multiple segments (e.g., one of the commoncrawl.org pages). The February crawl contains almost 10% such duplicates.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

February 2016 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available! This crawl archive holds more than 1.73 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2016-07/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-07/.

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data. Please contact [email protected] for sponsorship information and packages.

November 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available! This crawl archive is over 151TB in size and holds more than 1.82 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-48/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-48/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

September 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for September 2015 is now available! This crawl archive is over 106TB in size and holds more than 1.32 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-40/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-40/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

August 2015 Crawl Archive Available

The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the August 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

July 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-32/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the July 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.