We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017. These graphs, along with ranked lists of hosts and domains, follow the first (February, March, April 2017) and second (May, June, July 2017) web graph releases. Additional information about data formats, the processing pipeline, our objectives, and credits can be found in a prior announcement.
What's new?
Here is a summary of notable aspects of this web graph release:
- Tools and scripts to produce the web graph and rank the graph vertices are released as part of the project "cc-webgraph" on GitHub.
- As compared to prior web graphs, two changes are caused by the large size of this host-level graph (5.1 billion hosts):
- The text dump of the graph is split into multiple files;
- there is no page rank calculation at this time. At present, we provide ranking by harmonic centrality, and hope to add page rank values in the upcoming weeks.
- Update Feb 7, 2018: the host-level ranks file now also contains the page ranks. Thanks to Sebastiano Vigna, one of the authors of the WebGraph framework, for the kind support!
- For the domain-level graph, we provide ranking by both harmonic centrality and page rank.
- The host-level graph contains a significant portion of hosts related to link spam clusters (possibly 50% or more of the hosts). This data set, therefore, is a useful tool for the study of link spam; from it, we have identified 300,000 spam domains. 2.25 billion hosts in the host-level webgraph belong to these domains. However, in the October crawl archive, these domains comprise less than 2% of the crawled HTML pages (56 million pages out of 3.6 billion) and less than 0.3% of the crawled domains (70,000 out of 26 million). We will start to penalize pages from these domains going forward.
Host-level graph
The graph consists of 5.1 billion nodes and 18.8 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.
You can download the graph and the ranks of all 5.1 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/hostgraph/. Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/hostgraph/ as prefix to access the files from everywhere.
The following files and formats are provided:
To download the graph in text format, you need to download all files listed in the two path listings.
Domain-level graph
The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only "ICANN" domains are accepted; "private" domains are not accepted (cf. section "divisions" in the documentation on publicsuffix.org). For example, foo.blogspot.com and data.commoncrawl.org are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com.
The domain-level graph has 93 million nodes and 1,258 million edges. 60% or 56 million nodes are dangling nodes, the largest strongly connected component covers 31 million or 33% of the nodes.
All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/domaingraph/ resp. https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-aug-sep-oct/domaingraph/.
Download files of the Common Crawl Aug/Sept/Oct 2017 domain-level webgraph
Credits
Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible.
We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl's Google Group!