January 8, 2014

Winter 2013 Crawl Data Now Available

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013. The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size.

Common Crawl Foundation

Common Crawl - Open Source Web Crawling data‍

The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013 (see previous blog post for more detail on that dataset). The new dataset was collected at the end of 2013, contains approximately 2.3 billion webpages and is 148TB in size. The new data is located in the commoncrawl bucket at /crawl-data/CC-MAIN-2013-48/

Data Type	File List	#Files	Total Size Compressed (TiB)
Segments	segment.paths.gz	519
WARC	warc.paths.gz	51900	31.93
WAT	wat.paths.gz	45195	9.64
WET	wet.paths.gz	45195	3.36
URL index files	cc-index.paths.gz	302	0.14
Columnar URL index files	cc-index-table.paths.gz	300	0.15

‍

In 2013, we made changes to our crawling and post-processing systems. As detailed in the previous blog post, we switched file formats to the international standard WARC and WAT files. We also began using Apache Nutch to crawl – stay tuned for an upcoming blog post on our use of Nutch. The new crawling method relies heavily on the generous data donations from blekko and we are extremely grateful forongoing support!

In 2014 we plan to crawl much more frequently and publish fresh datasets at least once a month.

This release was authored by:

No items found.

Erratum:

Charset Detection Bug in WET Records

Originally reported by:

Javier de la Rosa

Permalink

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in IIPC Web Archive Commons (see the related issue in the CC fork of Apache Nutch). There should be significantly fewer errors in all subsequent crawls. Originally discussed here in Google Groups.

Erratum:

Missing Language Classification

Originally reported by:

Permalink

Starting with crawl CC-MAIN-2018-39 we added a language classification field (‘content-languages’) to the columnar indexes, WAT files, and WARC metadata for all subsequent crawls. The CLD2 classifier was used, and includes up to three languages per document. We use the ISO-639-3 (three-character) language codes.

Winter 2013 Crawl Data Now Available

Erratum:

Charset Detection Bug in WET Records

Erratum:

Missing Language Classification

The Data

Overview

Web Graphs

Latest Crawl

Resources

Get Started

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Discord Server

Collaborators

About

Team

Mission

Impact

Privacy Policy

Terms of Use