Search results

Common Crawl - Blog - Introducing cc-downloader

Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - March/April 2024 Newsletter

New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.

Common Crawl - Terms of Use

CC cannot guarantee the truthfulness, authenticity, quality, lawfulness or accuracy of the Crawled Content.

Common Crawl - Erratum - Incorrect fetch_time metadata

CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Erratum - WARC revisit metadata records

CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.

Common Crawl - Blog - August 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.

Common Crawl - Blog - December 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - January 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - September 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.

Common Crawl - Blog - October 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - April 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - November 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - June 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.

Common Crawl - Blog - March 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - August 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-35/segment.paths.gz). all WARC files. (CC-MAIN-2015-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.

Common Crawl - Blog - February 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-11/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-11/segment.paths.gz). all WARC files. (CC-MAIN-2015-11/warc.paths.gz). all WAT files.

Common Crawl - Erratum - WAT data: repeated WARC and HTTP headers are not preserved

CC-MAIN-2024-51. , see. ia-web-commons#18. All. WAT. files from. CC-MAIN-2013-20. until. CC-MAIN-2024-46. are affected. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new? The host-level graph now includes hosts visited by the crawler but not linking to any other host.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.

Common Crawl - Blog - Web Data Commons Extraction Framework for the Distributed Processing of CC Data

Web Data Commons Extraction Framework for the Distributed Processing of CC Data.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new?

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2019

You may also visit the projects. cc-webgraph. and. cc-pyspark. on GitHub which host all scripts and tools required to construct the graphs. What's new? Links from Content-Location and Link HTTP headers are now also used to span up the web graphs.

Common Crawl - Blog - January/February 2025 Newsletter

Annotation for Language Identification. cc-downloader Command Line Tool. Citations Updates. Common Crawl at SXSW 2025. Software Heritage Symposium at UNESCO. NeurIPS 2024 Social with Wikimedia. Annotation for Language Identification.

Common Crawl - Erratum - Erroneous title field in WAT records

CC-MAIN-2024-42. by. commoncrawl/ia-web-commons#37. This erratum affects all crawls from. CC-MAIN-2013-20. until. CC-MAIN-2024-38. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started.

Common Crawl - Blog - A Further Look Into the Prevalence of Various ML Opt–Out Protocols

We will refer to them as seed-crawl/CC-MAIN-2023-40 and seed-crawl/CC-MAIN-2023-50 respectively. Using two iterations is important because it gives us more insight on how the prevalence of each opt–out protocol may have changed over time.