April 2017 Crawl Archive Now Available

The crawl archive for April 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-17/. It contains 2.94 billion+ web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we ranked all hosts found in the February and March 2017 crawls by Harmonic Centrality, and

  • added 390 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 16 million hosts;
  • used sitemaps (if provided by any of these 16 million hosts) to take a random sample and add further 160 million URLs.

About 56% of the 2.94 billion URLs overlap with the preceding March 2017 crawl, with 550 million URLs not contained in any crawl archive before.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-17/segment.paths.gz100
WARC filesCC-MAIN-2017-17/warc.paths.gz6470055.03
WAT filesCC-MAIN-2017-17/wat.paths.gz
6470019.77
WET filesCC-MAIN-2017-17/wet.paths.gz
647008.95
Robots.txt filesCC-MAIN-2017-17/robotstxt.paths.gz
647000.11
Non-200 responsesCC-MAIN-2017-17/non200responses.paths.gz647000.84

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-17/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March 2017 Crawl Archive Now Available

The crawl archive for March 2017 is now available! The archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2017-13/. It contains 3.07 billion+ web pages and over 250 TiB of uncompressed content.

To improve coverage and freshness we ranked all hosts found in the February 2017 crawl by Harmonic Centrality, and

  • added 600 million URLs within a maximum of 2 links (“hops”) away from the home pages of the top 8 million hosts;
  • used sitemaps (if provided by any of these 8 million hosts) to take a random sample and add further 100 million URLs.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2017-13/segment.paths.gz100
WARC filesCC-MAIN-2017-13/warc.paths.gz6650060.74
WAT filesCC-MAIN-2017-13/wat.paths.gz
6650020.86
WET filesCC-MAIN-2017-13/wet.paths.gz
665009.30
Robots.txt filesCC-MAIN-2017-13/robotstxt.paths.gz
665000.10
Non-200 responsesCC-MAIN-2017-13/non200responses.paths.gz665000.82

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2017-13/. For more information on working with the URL index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the URL index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

November 2015 Crawl Archive Now Available

As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available! This crawl archive is over 151TB in size and holds more than 1.82 billion urls. The files are located in the commoncrawl bucket at /crawl-data/CC-MAIN-2015-48/

To assist with exploring and using the dataset, we’ve provided gzipped files that list:

By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

The CommonCrawl Url Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2015-48/

For more information on working with the url index, please refer to the previous blog post or the Index Server API. There is also a command-line tool client for common use cases of the url index.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in big open data! Contact [email protected] for sponsorship information and packages.

Please Donate To Common Crawl!

Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data. Imagine the questions we could answer and the problems we could solve if talented, creative technologists could freely access more big data.

At Common Crawl, we are passionate about getting big open data into the hands of talented and creative people. Increasing access to data enables everything from business innovation to groundbreaking research.

Common Crawl is proud of what we have accomplished in 2014 thanks to our dedicated team and the support of donors like you.

This year:

  • 19 academic publications using Common Crawl data were published
  • Several Open Educational Resources designed to teach big data tools and methods were created
  • 1.3 petabytes containing 18.5 billion web pages were added to the Common Crawl corpus
  • Numerous presentations and tutorials were given at international conferences, local meetup groups, and academic workshops in six countries

100% of our funding comes from donors like you — Thank you! We accomplish a great deal with a small, dedicated staff on a limited budget so your investment in us has a big impact.

More resources for Common Crawl means more access to data for everyone. In 2015 we have big plans to scale up crawling to more rapidly increase the Common Crawl corpus and to grow our educational programs and catalogue of tutorials in order to invest in the next generation of data-driven technologists.

Whether it’s $10 or $10,000, Common Crawl needs your donation today.

Donate here!

Thank you very much,
Lisa Green and the Common Crawl team