March 1, 2019

February 2019 crawl archive now available

The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th.

Sebastian Nagel

Sebastian is a Distinguished Engineer with Common Crawl.

The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th.

Data Type	File List	#Files	Total Size Compressed (TiB)
Segments	segment.paths.gz	100
WARC	warc.paths.gz	64000	59.86
WAT	wat.paths.gz	64000	18.23
WET	wet.paths.gz	64000	7.62
Robots.txt files	robotstxt.paths.gz	64000	0.17
Non-200 responses	non200responses.paths.gz	64000	1.79
URL index files	cc-index.paths.gz	302	0.21
Columnar URL index files	cc-index-table.paths.gz	900	0.26

‍

The February crawl contains page captures of 750 million URLs not contained in any crawl archive before. New URLs are sampled based on the host and domain ranks (harmonic centrality) published as part of the Nov/Dec/Jan 2018/2019 webgraph data set from the following sources:

sitemaps, RSS and Atom feeds
a breadth-first side crawl within a maximum of 5 links (“hops”) away from the homepages of the top 50 million hosts and domains
a random sample of outlinks taken from WAT files of the January crawl

‍

The number of sampled URLs per domain depends on the domain's harmonic centrality rank in the webgraph data set – higher ranking domain are allowed to “contribute” more URLs.

The way our crawler handles politeness limits per host and/or pay-level domain has been improved:

First, limits are now configurable and are based on the harmonic centrality rank of a domain.

Second, we now also put a limit on the number of hosts/subdomains per domain. This limit is also based on the domain rank and ranges from 500,000 subdomains for top-ranking domains (think of blogspot.com) to less than 100 for low-ranking domains.

While the the number of hosts covered in the February crawl dropped to 50 millions from 60 millions in January, we see a positive impact on the total amount of pages crawled for large domains. Technically, every host requires a DNS lookup and a robots.txt fetch even if only a single page is fetched from this host and the performance of the crawler improves if resources are focused on few 100,000 subdomains and not spread over millions of hosts. We also hope that a limit on the number of hosts per domain makes the crawler more robust against link spam. The set of sampled subdomains for large domains will vary from month to month to guarantee a good overall coverage if multiple monthly crawls are combined.

Archive Location and Download

The February crawl archive is located in the commoncrawl bucket at crawl-data/CC-MAIN-2019-09/.

To assist with exploring and using the dataset, we provide gzipped files which list all segments, WARC, WAT and WET files. By simply adding either s3://commoncrawl/ or https://data.commoncrawl.org/ to each line, you end up with the S3 and HTTP paths respectively.

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2019-09/. Also the columnar index has been updated to contain this crawl.

Please donate to Common Crawl if you appreciate our free datasets! We’re also seeking corporate sponsors to partner with Common Crawl for our non-profit work in open data. Please contact info@commoncrawl.org for sponsorship information.

This release was authored by:

No items found.

February 2019 crawl archive now available

Archive Location and Download

The Data

Overview

Web Graphs

Latest Crawl

Resources

Get Started

Blog

Examples

Use Cases

CCBot

Infra Status

FAQ

Community

Research Papers

Mailing List Archive

Discord Server

Collaborators

About

Team

Mission

Impact

Privacy Policy

Terms of Use