October 2020 crawl archive now available

The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-45/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-45/segment.paths.gz100
WARC filesCC-MAIN-2020-45/warc.paths.gz7200063.79
WAT filesCC-MAIN-2020-45/wat.paths.gz7200018.39
WET filesCC-MAIN-2020-45/wet.paths.gz720008.23
Robots.txt filesCC-MAIN-2020-45/robotstxt.paths.gz720000.2
Non-200 responses filesCC-MAIN-2020-45/non200responses.paths.gz720001.75
URL index filesCC-MAIN-2020-45/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-45/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

September 2020 crawl archive now available

The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content. It includes page captures of 1.5 billion new URLs, not visited in any of our prior crawls.

Archive Location and Download

The September crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-40/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-40/segment.paths.gz100
WARC filesCC-MAIN-2020-40/warc.paths.gz7960081.8
WAT filesCC-MAIN-2020-40/wat.paths.gz7960023.14
WET filesCC-MAIN-2020-40/wet.paths.gz7960010.28
Robots.txt filesCC-MAIN-2020-40/robotstxt.paths.gz796000.22
Non-200 responses filesCC-MAIN-2020-40/non200responses.paths.gz796002.36
URL index filesCC-MAIN-2020-40/cc-index.paths.gz3020.27

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-40/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

August 2020 crawl archive now available

The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th. It includes page captures of 940 million URLs unknown in any of our prior crawl archives.

Archive Location and Download

The August crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-34/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-34/segment.paths.gz100
WARC filesCC-MAIN-2020-34/warc.paths.gz6000048.9
WAT filesCC-MAIN-2020-34/wat.paths.gz6000016.9
WET filesCC-MAIN-2020-34/wet.paths.gz600007.56
Robots.txt filesCC-MAIN-2020-34/robotstxt.paths.gz600000.19
Non-200 responses filesCC-MAIN-2020-34/non200responses.paths.gz600001.94
URL index filesCC-MAIN-2020-34/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-34/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

July 2020 crawl archive now available

The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th. It includes page captures of 1.1 billion URLs unknown in any of our prior crawl archives.

Bug Fixes and Improvements

The URL index fields "redirect" and "mime" haven’t been filled if the corresponding HTTP headers Location and Content-Type are written in lower-case letters or any other variant not matching case. This bug has been detected during the crawl and was fixed for 90 out of 100 segments. It also affects the columnar index and the fields "fetch_redirect" resp. "content_mime_type". To a minor extend it may affect the detection of character set and content language as the value of the Content-Type header is used as additional hint for the detection. Additional information about this bug fix is given in the corresponding issue report.

Archive Location and Download

The July crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-29/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-29/segment.paths.gz100
WARC filesCC-MAIN-2020-29/warc.paths.gz6000062.64
WAT filesCC-MAIN-2020-29/wat.paths.gz6000022.23
WET filesCC-MAIN-2020-29/wet.paths.gz600009.87
Robots.txt filesCC-MAIN-2020-29/robotstxt.paths.gz600000.21
Non-200 responses filesCC-MAIN-2020-29/non200responses.paths.gz600002.52
URL index filesCC-MAIN-2020-29/cc-index.paths.gz3020.24

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-29/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

May/June 2020 crawl archive now available

The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives.

Starting with this crawl the WET files indicate the natural language(s) a text is written in. The language is detected using Compact Language Detector 2 (CLD2) and was made available since August 2018 only in WARC and WAT files and URL indexes. It is now also provided in WET files in the WARC header "WARC-Identified-Content-Language". Up to three language(s) are detected per document and given as comma-separated list of ISO-639-3 codes, here one example WET record fragment:

...
WARC-Identified-Content-Language: isl,eng
Content-Type: text/plain
Content-Length: 10494

Bananabrauð með Nutella – Ljúfmeti og lekkerheit
...

Additional information about this improvement is given in the corresponding issue report.

Archive Location and Download

The May/June crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-24/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-24/segment.paths.gz100
WARC filesCC-MAIN-2020-24/warc.paths.gz6000053.16
WAT filesCC-MAIN-2020-24/wat.paths.gz6000019.02
WET filesCC-MAIN-2020-24/wet.paths.gz600008.42
Robots.txt filesCC-MAIN-2020-24/robotstxt.paths.gz600000.22
Non-200 responses filesCC-MAIN-2020-24/non200responses.paths.gz600002.77
URL index filesCC-MAIN-2020-24/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-24/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

March/April 2020 crawl archive now available

The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Archive Location and Download

The March/April crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-16/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-16/segment.paths.gz100
WARC filesCC-MAIN-2020-16/warc.paths.gz5600062.67
WAT filesCC-MAIN-2020-16/wat.paths.gz5600020.37
WET filesCC-MAIN-2020-16/wet.paths.gz560008.97
Robots.txt filesCC-MAIN-2020-16/robotstxt.paths.gz560000.19
Non-200 responses filesCC-MAIN-2020-16/non200responses.paths.gz560001.39
URL index filesCC-MAIN-2020-16/cc-index.paths.gz3020.21

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-16/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

February 2020 crawl archive now available

The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th. It includes page captures of 1 billion URLs unknown in any of our prior crawl archives.

Improvements and Fixes

The HTTP headers in WARC response records have been fixed: the HTTP response status line now has a white space following the status code if the reason-phrase is empty. E.g., if a server sends an empty message (instead of “OK”), the status line will include a trailing space character: “HTTP/1.1 200 ”. Following RFC 7230 the white space between status code and message is mandatory. Please refer to the bug report NUTCH-2763 for further details.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-10/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-10/segment.paths.gz100
WARC filesCC-MAIN-2020-10/warc.paths.gz5600049.28
WAT filesCC-MAIN-2020-10/wat.paths.gz5600017.98
WET filesCC-MAIN-2020-10/wet.paths.gz560007.97
Robots.txt filesCC-MAIN-2020-10/robotstxt.paths.gz560000.22
Non-200 responses filesCC-MAIN-2020-10/non200responses.paths.gz560002.21
URL index filesCC-MAIN-2020-10/cc-index.paths.gz3020.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-10/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

January 2020 crawl archive now available

The crawl archive for January 2020 is now available! It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th. It includes page captures of 960 million URLs not contained in any crawl archive before.

Improvements and Fixes

  • date time values in the column "fetch_time" of the columnar index are now stored using the "int64" data type. For details and compatibility issues please see cc-index-table#7
  • WARC request records now show the HTTP protocol version sent with the HTTP request which can be different from the version received in the HTTP response message, cf. NUTCH-2760

Archive Location and Download

The January crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2020-05/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2020-05/segment.paths.gz100
WARC filesCC-MAIN-2020-05/warc.paths.gz5600059.94
WAT filesCC-MAIN-2020-05/wat.paths.gz5600022.3
WET filesCC-MAIN-2020-05/wet.paths.gz5600010
Robots.txt filesCC-MAIN-2020-05/robotstxt.paths.gz560000.25
Non-200 responses filesCC-MAIN-2020-05/non200responses.paths.gz560002.28
URL index filesCC-MAIN-2020-05/cc-index.paths.gz3020.23

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2020-05/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

December 2019 crawl archive now available

The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th. It includes page captures of 850 million URLs not contained in any crawl archive before.

Archive Location and Download

The December crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-51/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-51/segment.paths.gz100
WARC filesCC-MAIN-2019-51/warc.paths.gz5600047.47
WAT filesCC-MAIN-2019-51/wat.paths.gz5600017.6
WET filesCC-MAIN-2019-51/wet.paths.gz560008.06
Robots.txt filesCC-MAIN-2019-51/robotstxt.paths.gz560000.26
Non-200 responses filesCC-MAIN-2019-51/non200responses.paths.gz560003.5
URL index filesCC-MAIN-2019-51/cc-index.paths.gz3020.19

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-51/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.

October 2019 crawl archive now available

The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th. It includes page captures of 1.1 billion URLs not contained in any crawl archive before.

Archive Location and Download

The October crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-43/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files.
By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

 File List#FilesTotal Size
Compressed (TiB)
SegmentsCC-MAIN-2019-43/segment.paths.gz100
WARC filesCC-MAIN-2019-43/warc.paths.gz5600059.56
WAT filesCC-MAIN-2019-43/wat.paths.gz5600021.7
WET filesCC-MAIN-2019-43/wet.paths.gz560009.94
Robots.txt filesCC-MAIN-2019-43/robotstxt.paths.gz560000.15
Non-200 responses filesCC-MAIN-2019-43/non200responses.paths.gz560001.69
URL index filesCC-MAIN-2019-43/cc-index.paths.gz3020.22

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-43/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact [email protected] for sponsorship information.