The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.
Archive Location and Download
The March/April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-16/.
To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://commoncrawl.s3.amazonaws.com/ to each line, you end up with the S3 and HTTP paths respectively.
|File List||#Files||Total Size
|Non-200 responses files||CC-MAIN-2020-16/non200responses.paths.gz||56000||1.39|
|URL index files||CC-MAIN-2020-16/cc-index.paths.gz||302||0.21|
Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.