The crawl archive for June 2016 is now available! The archive located in the commoncrawl bucket at crawl-data/CC-MAIN-2016-26/ contains more than 1.23 billion web pages.
To assist with exploring and using the dataset, we provide gzipped files that list:
- all segments (CC-MAIN-2016-26/segment.paths.gz)
- all WARC files (CC-MAIN-2016-26/warc.paths.gz)
- all WAT files (CC-MAIN-2016-26/wat.paths.gz)
- all WET files (CC-MAIN-2016-26/wet.paths.gz)
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
The Common Crawl URL Index for this crawl is available at: http://index.commoncrawl.org/CC-MAIN-2016-26/.
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact firstname.lastname@example.org for sponsorship information and packages.