Web Graphs

Common Crawl regularly releases host- and domain-level graphs, for visualising the crawl data.

Hostnames in the graph are in reverse domain name notation and all types of links are listed, including purely “technical” links pointing to images, JavaScript libraries, web fonts, etc.

However, only hostnames with a valid IANA TLD are used. As a result, URLs with an IP address as host component are not taken into account for building the host-level graph.

The domain graph is built by aggregating the host graph at the pay-level domain (PLD) level based on the public suffix list maintained on publicsuffix.org.

The list of graph releases is also available via graphinfo.json.

For more information please see cc-webgraph on GitHub.
A fantasy graph of global data
Many boxes representing data sets

Credits

Thanks to the authors of the WebGraph Framework, whose software made the computation of graph properties and ranks possible.

We hope you find the data useful for any kind of research on ranking, graph analysis, link spam detection, etc. Let us know about your results via Common Crawl’s Google Group!