We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. The following results from the development of this graph:
- a ranked list of hosts to expand the crawl frontier;
- pages ranked by Harmonic Centrality with less influence from spam, among other attributes (for comparison we include PageRank);
- the template/process for Common Crawl to produce graphs and page rankings at regular intervals.
We produced this graph, and intend to produce similar graphs going forward, because the Common Crawl community has expressed a strong interest in using Common Crawl data for graph processing, particularly with respect to:
- web graph and page rankings produced by Common Search in 2016;
- the Hyperlink Graph data set produced in 2013 by Web Data Commons (WDC);
- the "WWW Ranking" from WDC, along with a second set of hyperlink graphs based on crawl data from April 2014.
* Please note: the graph includes dangling nodes, i.e. hosts that have not been crawled yet are pointed to from a link on a crawled page. Seventeen percent (65 million) of the hosts represented have been crawled in one of the three monthly crawls. Thus, 320 million of the hosts represented in the graph are known only from links. (Host names are not wholly verified: host names that are obviously invalid are skipped; others are not resolved in DNS.)
Extraction of links and construction of the graph
The host names are reversed and a leading www. is stripped: www.subdomain.example.com becomes com.example.subdomain. Node IDs are assigned sequentially to the the node list sorted by reversed host name. This keeps links between hosts of the same domain or in the same country-code top-level domain close together and allows for an efficient delta-compression of edges. The extraction is done in three steps:
- links are extracted, reduced to host-level links and stored as pairs 〈reversed host from, rev. host to〉
- host names are assigned to IDs and edges are represented as 〈from id, to id〉 pairs
- ranks are computed.
Hosts ranked by Harmonic Centrality and PageRank
We provide a list of ranked nodes (host names) by
Data and download instructions
The host-level graph as well as the rankings are placed on AWS S3 on the path:
Alternatively, you can use:
as prefix to access the files from everywhere.
Download files of the Common Crawl Feb/Mar/Apr 2017 host-level webgraph
We hope the data will be useful for you to do any kind of research on ranking, graph analysis, link SPAM detection, etc. Let us know about your results via Common Crawl's Google Group!
- Web Data Commons, for their web graph data set and everything related.
- Common Search; we first used their web graph to expand the crawler frontier, and Common Search's cosr-back project was an important source of inspiration how to process our data using PySpark.
- the authors of the WebGraph framework, whose software simplifies the computation of rankings.