< Back to Blog
February 8, 2018

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

Note: this post has been marked as obsolete.
We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017).
Sebastian Nagel
Sebastian Nagel
Sebastian is a Distinguished Engineer with Common Crawl.

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018. These graphs, along with ranked lists of hosts and domains, follow the prior web graph releases (Feb/Mar/Apr 2017, May/Jun/Jul 2017 and Aug/Sep/Oct 2017). Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the preceding announcements.

Please note that the first released version (released 2018-02-08, withdrawn 2018-02-21) contained only links from the January 2018 crawl, see the notice on the Common Crawl user group. On 2018-02-28 a fix has been provided with graphs or rankings containing all links, hosts and/or domains over all 3 crawls. We also provide the erroneously released graphs and rankings from the January 2018 crawl.

What's new?

Here is a summary of notable aspects and changes of this web graph release:

  • a bug has been fixed which caused that relative links pointing to a different host (//www.example.com/index.html) are not added as edges of the host/domain-level webgraphs
  • the domain graph now contains the number of hosts per domain as additional column in the vertices and rankings files
  • the naming scheme has changed – the release name is now part of the file name
  • webgraph offset files are not released anymore, they can be created by running:
    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-host
    java it.unimi.dsi.webgraph.BVGraph -O -L cc-main-2017-18-nov-dec-jan-domain

Host-level graph

The graph consists of 2.75 billion nodes and 8.6 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 2.67 billion dangling nodes (97%) and the largest strongly connected component contains only 65 million (2.3%) nodes. The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain.

You can download the graph and the ranks of all 2.75 billion hosts from AWS S3 on the path s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/. Alternatively, you can use https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/host/ as prefix to access the files from everywhere.

The following files and formats are provided:

Download files of the Common Crawl Nov/Dec/Jan 2017-18 host-level Webgraph

Size File Description
15.9 GB cc-main-2017-18-nov-dec-jan-host-vertices.paths.gz nodes ⟨id, rev host⟩, paths of 28 vertices files
40.0 GB cc-main-2017-18-nov-dec-jan-host-edges.paths.gz edges ⟨from_id, to_id⟩, paths of 28 edges files
16.4 GB cc-main-2017-18-nov-dec-jan-host.graph graph in BVGraph format
2 kB cc-main-2017-18-nov-dec-jan-host.properties
24.2 GB cc-main-2017-18-nov-dec-jan-host-t.graph transpose of the graph (outlinks inverted to inlinks)
2 kB cc-main-2017-18-nov-dec-jan-host-t.properties
1 kB cc-main-2017-18-nov-dec-jan-host.stats WebGraph statistics
38.1 GB cc-main-2017-18-nov-dec-jan-host-ranks.txt.gz harmonic centrality and pagerank

Domain-level graph

The domain graph was built by aggregating the host graph on the level of pay-level domains (PLDs). The extraction of PLDs is based on the public suffix list from publicsuffix.org. Only "ICANN" domains are accepted; "private" domains are not accepted (cf. section "divisions" in the documentation on publicsuffix.org). For example, foo.blogspot.com and data.commoncrawl.org are not accepted as pay-level domains, they are aggregated, respectively, as the domains blogspot.com, amazonaws.com and stored in the reversed form com.blogspot.

The domain-level graph has 94 million nodes and 1.44 billion edges. 59% or 56 million nodes are dangling nodes, the largest strongly connected component covers 33 million or 35% of the nodes.

All files related to the domain graph are available on AWS S3 under s3://commoncrawl/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/ resp. https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2017-18-nov-dec-jan/domain/.

Download files of the Common Crawl Nov/Dec/Jan 2017-18 domain-level Webgraph

Graphs of January 2018 Crawl

Erroneously we released webgraphs and rankings of a single monthly crawl (January 2018) instead of a quarterly release covering 3 crawls. To ensure reproducibility we've preserved the erroneous release. The host-level graph consists of 775 million nodes and 2.7 billion edges. The graph includes dangling nodes i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. There are 719 million dangling nodes (93%).

Download files of the Common Crawl Jan 2018 host-level Webgraph

Size File Description
4.84 GB cc-main-2018-jan-host-vertices.txt.gz nodes ⟨id, rev host⟩
10.21 GB cc-main-2018-jan-host-edges.txt.gz edges ⟨from_id, to_id⟩
4.90 GB cc-main-2018-jan-host.graph graph in BVGraph format
2 kB cc-main-2018-jan-host.properties
5.94 GB cc-main-2018-jan-host-t.graph transpose of the graph (outlinks mapped to inlinks)
2 kB cc-main-2018-jan-host-t.properties
1 kB cc-main-2018-jan-host.stats WebGraph statistics
10.79 GB cc-main-2018-jan-host-ranks.txt.gz harmonic centrality and pagerank

The domain-level graph with 70 million nodes and 835 million edges has 60% or 42 million nodes are dangling nodes, the largest strongly connected component covers 22 million or 31% of the nodes.

Download files of the Common Crawl Jan 2018 domain-level webgraph

Size File Description
0.49 GB cc-main-2018-jan-domain-vertices.txt.gz nodes ⟨id, rev domain, num hosts⟩
3.30 GB cc-main-2018-jan-domain-edges.txt.gz edges ⟨from_id, to_id⟩
1.80 GB cc-main-2018-jan-domain.graph graph in BVGraph format
2 kB cc-main-2018-jan-domain.properties
1.89 GB cc-main-2018-jan-domain-t.graph transpose of the graph
2 kB cc-main-2018-jan-domain-t.properties
1 kB cc-main-2018-jan-domain.stats WebGraph statistics
1.46 GB cc-main-2018-jan-domain-ranks.txt.gz harmonic centrality and pagerank

Credits

Thanks to the authors of the WebGraph framework, whose software made the computation of graph properties and ranks possible. We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl's Google Group!

This release was authored by:
No items found.