Search results

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2019

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

Here is a summary of notable aspects and changes of this web graph release: a bug has been fixed which caused that relative links pointing to a different host (//www.example.com/index.html) are not added as edges of the host/domain-level webgraphs. the domain…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018

Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Interactive Webgraph Statistics Notebook Released

Interactive Webgraph Statistics Notebook Released. We are pleased to announce the release of an interactive Jupyter notebook that is used to provide visualization of webgraph statistics, and a way to interact with the webgraph. Alex Xue.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2018

Additional information about data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Erratum - Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates

Nodes in Domain-Level Webgraphs Not Sorted and May Include Duplicates. Originally reported by. covuworie. The nodes in domain-level Web Graphs may not be properly sorted lexicographically by node label (reversed domain name).…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April and May 2021

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

For more information about the data formats and the processing pipeline, please see the announcements of previous webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July/August and September 2021

Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior webgraph releases. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation.…

Common Crawl - Blog - Common Crawl's First In-House Web Graph

To compute the rankings the webgraph is loaded into the. WebGraph framework. Hosts ranked by Harmonic Centrality and PageRank. We provide a list of ranked nodes (host names) by. Harmonic Centrality. (calculated by. HyperBall. ). and PageRank (by.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

You can also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the. webgraph. format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Web Graphs

For more information please see. cc-webgraph on GitHub. , and our. Web Graph Statistics. page. Credits. Thanks to the authors of the. WebGraph Framework. , whose software made the computation of graph properties and ranks possible.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

Feb/Mar/Apr 2018 webgraph data set. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 25 million hosts or top 25 million domains of the webgraph dataset. a random sample taken from WAT files of the June crawl…

Common Crawl - Blog - January 2019 crawl archive now available

Aug/Sep/Oct 2018 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 50 million hosts and domains. a random sample of outlinks taken…

Common Crawl - Blog - December 2018 crawl archive now available

Aug/Sep/Oct 2018 webgraph data set. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the home pages of the top 50 million domains of the webgraph dataset. a random sample of outlinks taken from WAT files of the November crawl. 30 million…

Common Crawl - Blog - May 2018 Crawl Archive Now Available

Feb/Mar/Apr 2018 webgraph data set. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 40 million hosts or top 40 million domains of the webgraph dataset. a random sample taken from WAT files of the April crawl…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. web graph notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the WebGraph format are given in our collection of. Web Graph notebooks.…

Common Crawl - Blog - November 2018 crawl archive now available

Aug/Sep/Oct 2018 webgraph data set. a breadth-first side crawl within a maximum of 10 links (“hops”) away from the home pages of the top 40 million domains of the webgraph dataset. a random sample of outlinks taken from WAT files of the October crawl. 50 million…

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the WebGraph format are given in our collection of. Web Graph notebooks.…

Common Crawl - Blog - October 2018 crawl archive now available

May/June/July 2018 webgraph data set. a breadth-first side crawl within a maximum of 10 links (“hops”) away from the home pages of the top 40 million domains of the webgraph dataset. a random sample of outlinks taken from WAT files of the September crawl. 15…

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

"cc-webgraph" on GitHub. As compared to prior web graphs, two changes are caused by the large size of this host-level graph (5.1 billion hosts): The text dump of the graph is split into multiple files; there is no page rank calculation at this time.…

Common Crawl - Blog - January 2018 Crawl Archive Now Available

Aug/Sept/Oct 2017 webgraph data set. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts or top 25 million domains of the webgraph dataset. a random sample taken from WAT files of the December…

Common Crawl - Blog - June 2018 Crawl Archive Now Available

Feb/Mar/Apr 2018 webgraph data set. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 25 million hosts or top 25 million domains of the webgraph dataset. a random sample taken from WAT files of the May crawl…

Common Crawl - Blog - March 2018 Crawl Archive Now Available

Nov/Dec/Jan 2017/2018 webgraph data set. a breadth-first side crawl within a maximum of 3 links (“hops”) away from the home pages of the top 60 million hosts or top 30 million domains of the webgraph dataset. a random sample taken from WAT files of the February…

Common Crawl - Blog - April 2018 Crawl Archive Now Available

Nov/Dec/Jan 2017/2018 webgraph data set.…

Common Crawl - Blog - Web Graph Statistics Gets a Proper Upgrade

The. cc-webgraph-statistics. site has been quietly doing its job for a while now: publishing harmonic centrality and PageRank data derived from Common Crawl's web graph, and letting researchers and the curious alike poke around the rankings.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2025 and January/February 2026

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2026

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. Web Graph Notebooks.…

Common Crawl - Blog - February 2018 Crawl Archive Now Available

January 2018 webgraph data set. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the home pages of the top 50 million hosts or top 25 million domains of the webgraph dataset. a random sample taken from WAT files of the January crawl…

Common Crawl - Blog - February 2019 crawl archive now available

Nov/Dec/Jan 2018/2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 5 links (“hops”) away from the homepages of the top 50 million hosts and domains. a random sample of outlinks…