The crawl archive for July 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-30/ contains more than 1.73 billion web pages.
To assist with exploring and using the dataset, we provide gzipped files that list:
- all segments (CC-MAIN-2016-30/segment.paths.gz)
- all WARC files (CC-MAIN-2016-30/warc.paths.gz)
- all WAT files (CC-MAIN-2016-30/wat.paths.gz)
- all WET files (CC-MAIN-2016-30/wet.paths.gz)
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-30/.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information and packages.