Search results
← Back to Blog. January 8, 2013. Common Crawl URL Index. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. Scott Robertson.…
URL Index. The Common Crawl Foundation provides two indexes for querying the Common Crawl Corpus: the. CDXJ Index. and the. URL Index. This page introduces the URL Index and gives some examples of how to use it.…
← Back to Blog. June 3, 2026. The Columnar Index Is Now the URL Index. We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. Common Crawl Foundation.…
← Back to Blog. March 5, 2013. URL Search Tool! A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.…
URL Index Subsets with Fewer than 900 Partitions per Crawl. Originally reported by. Sebastian Nagel.…
CDXJ Index. The Common Crawl Foundation provides two indexes for querying the Common Crawl corpus: the CDXJ index and the. URL Index. (previously called the "Columnar Index"). This page introduces the CDXJ index and gives some examples of how to use it.…
← Back to Blog. April 8, 2015. Announcing the Common Crawl Index! This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.…
← Back to Blog. January 13, 2026. GneissWeb Annotations Examples. A new Common Crawl index annotation has been added to Hugging Face and our S3 bucket. Thijs Dalhuijsen. Thijs Dalhuijsen is Engineering Manager at Common Crawl.…
Common Crawl Foundation. Common Crawl builds and maintains an open repository of web crawl data that can be accessed and analyzed by anyone.…
← Back to Blog. November 13, 2018. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018.…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of May, June, and July 2024. The crawls used to generate the graphs were.…
← Back to Blog. October 16, 2020. Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020.…
You can download the graph and the ranks of all 279.4 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-26-nov-dec-jan/host/. (this requires an account on AWS).…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, April, and May 2024. The crawls used to generate the graphs were.…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023. The crawls used to generate the graphs were.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of September, November, February 2023-24. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, February, April 2024. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
← Back to Blog. November 12, 2019. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.…
Crawls process approximately three billion pages per cycle. CCBot uses. Harmonic Centrality. , a graph-theoretic measure of a node's proximity to the structural core of the web, to prioritise URLs for crawling. This metric is computed from our.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020.…
The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
← Back to Blog. June 19, 2016. May 2016 Crawl Archive Now Available. The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive. Sebastian Nagel.…
The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018.…
← Back to Blog. November 7, 2016. October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel.…
← Back to Blog. December 16, 2016. December 2016 Crawl Archive Now Available. The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages. Sebastian Nagel.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.…
You can download the graph and the ranks of all 277.7 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-25-nov-dec-jan/host/. (this requires an account on AWS).…
Laurie is a Principal Research Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August, and September 2025.…
You can download the graph and the ranks of all 235.7 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-sep-oct-nov/host/. (this requires an account on AWS).…
Michael is a Senior Research Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, April, and May 2026. The crawls used to generate the graphs were.…
You can download the graph and the ranks of all 468.4 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-aug-sep-oct/host/. (this requires an account on AWS).…
Luca Foppiano is a Senior Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March, and April 2026. The crawls used to generate the graphs were.…
You can download the graph and the ranks of all 250.8 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2025-oct-nov-dec/host/. (this requires an account on AWS).…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023. The crawls used to generate the graphs were.…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of February, March, and April 2025. The crawls used to generate the graphs were.…
← Back to Blog. March 26, 2024. March/April 2024 Newsletter. We're excited to share an update on some of our recent projects and initiatives in this newsletter! Common Crawl Foundation.…
Luca Foppiano is a Senior Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of January, February, and March 2026. The crawls used to generate the graphs were.…
Laurie is a Principal Research Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July, and August 2025. The crawls used to generate the graphs were.…
You can download the graph and the ranks of all 267.4 million hosts from AWS S3 on the path. s3://commoncrawl/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/host/. (this requires an account on AWS).…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of January, February, and March 2025. The crawls used to generate the graphs were.…
← Back to Blog. December 10, 2020. November/December 2020 crawl archive now available. The crawl archive for November/December 2020 is now available!…
It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for October 2021 is now available!…
← Back to Blog. October 7, 2020. September 2020 crawl archive now available. The crawl archive for September 2020 is now available!…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2020 and January 2021.…
It includes page captures of 1.28 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for May 2021 is now available!…
← Back to Blog. November 7, 2020. October 2020 crawl archive now available. The crawl archive for October 2020 is now available!…
It includes page captures of 1.35 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for January 2022 is now available!…
Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for June 2018 is now available! The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-26/.…
It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for September 2021 is now available!…
Thom is Principal Engineer at the Common Crawl Foundation. We are pleased to announce a new release of host-level and domain-level Web Graphs based on the crawls of April, May, and June 2025. The crawls used to generate the graphs were.…
← Back to Blog. May 9, 2017. April 2017 Crawl Archive Now Available. The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.…
It includes page captures of 1.15 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for January 2021 is now available!…
It includes page captures of 1.3 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel. Sebastian is a Distinguished Engineer at the Common Crawl Foundation. The crawl archive for June 2021 is now available!…