July 2016 Crawl Archive Now Available

The crawl archive for July 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-30/ contains more than 1.73 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-30/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

June 2016 Crawl Archive Now Available

The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

May 2016 Crawl Archive Now Available

The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-22/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-22/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

We are grateful to our friends at Moz for donating a seed list of 400 Million URL’s to enhance the Common Crawl. The seeds from Moz were used for the May crawl in addition to the seeds from the preceding crawl. Moz URL data will be incorporated into future crawls as well.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

April 2016 Crawl Archive Now Available

The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive, which is located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-18/.

To assist with exploring and using the dataset, we provide gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-18/.

For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Note that the April crawl is based on the same URL seed list as the preceding crawl of February 2016. However, the manner in which the crawler follows redirects is changed: redirects are not followed immediately; instead redirect targets from the current crawl are recorded and followed by the subsequent crawl. This approach serves to avoid duplicates with exactly the same URL contained in multiple segments (e.g., one of the commoncrawl.org pages). The February crawl contains almost 10% such duplicates.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.

February 2016 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available! This crawl archive holds more than 1.73 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2016-07/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-07/.

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data. Please contact [email protected] for sponsorship information and packages.

November 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available! This crawl archive is over 151TB in size and holds more than 1.82 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-48/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-48/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

September 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for September 2015 is now available! This crawl archive is over 106TB in size and holds more than 1.32 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-40/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2015-40/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

August 2015 Crawl Archive Available

The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-35/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the August 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

July 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-32/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the July 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

June 2015 Crawl Archive Available

The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-27/.

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.

The release also includes the June 2015 Common Crawl Index, constructed by Ilya Kreymer, creator of https://webrecorder.io/. The Common Crawl Index offers a fascinating and new way to explore the dataset! For full details, refer to Ilya’s guest blog post.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.