The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1 billion new URLs, not visited in any of our prior crawls.
Archiving of robots.txt files was improved. Robots.txt files are not archived if
- the robots.txt of the target host does not allow it (in case of a HTTP redirect) or
- URL filters exclude the entire site, eg. if it’s known ahead that a site does not allow crawling or
- the MIME type is not applicable for robots.txt files (eg. HTML, PDF)
Archive Location and Download
The July/August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2021-31/.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
|File List||#Files||Total Size
|Non-200 responses files||CC-MAIN-2021-31/non200responses.paths.gz||72000||1.98|
|URL index files||CC-MAIN-2021-31/cc-index.paths.gz||302||0.23|
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.