Search results

Common Crawl - Blog - Introducing cc-downloader

Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is a Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - March/April 2024 Newsletter

New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.

Common Crawl - Terms of Use

CC cannot guarantee the truthfulness, authenticity, quality, lawfulness or accuracy of the Crawled Content.

Common Crawl - Erratum - Truncated WAT Files

Four WAT files of the March 2017 crawl (CC-MAIN-2017-13) are truncated, potentially causing an error when processing them.

Common Crawl - Erratum - Incorrect fetch_time metadata

CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Erratum - WARC revisit metadata records

CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.

Common Crawl - Erratum - Missing WARC File

One WARC and WET is missing in June 2017 Crawl (CC-MAIN-2017-26). The corresponding WAT file is present, as well as the URL index entries contained in the missing WARC file. For more details, see the. release announcement in the Common Crawl Google Group.

Common Crawl - Blog - August 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - December 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.

Common Crawl - Blog - January 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

CC-MAIN-2025-33. , CC-MAIN-2025-38. , and. CC-MAIN-2025-43. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - June 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.

Common Crawl - Blog - March 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2025

CC-MAIN-2025-21. , CC-MAIN-2025-26. , and. CC-MAIN-2025-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2025

CC-MAIN-2025-26. , CC-MAIN-2025-30. , and. CC-MAIN-2025-33. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - November 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.

Common Crawl - Blog - October 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025

CC-MAIN-2025-13. , CC-MAIN-2025-18. , and. CC-MAIN-2025-21. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

CC-MAIN-2025-43. , CC-MAIN-2025-47. , and. CC-MAIN-2025-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - July 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.

Common Crawl - Blog - August 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-35/segment.paths.gz). all WARC files. (CC-MAIN-2015-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025

CC-MAIN-2025-30. , CC-MAIN-2025-33. , and. CC-MAIN-2025-38. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025

CC-MAIN-2025-26. , CC-MAIN-2025-21. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

CC-MAIN-2025-38. , CC-MAIN-2025-43. , and. CC-MAIN-2025-47. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Erratum - WAT data: repeated WARC and HTTP headers are not preserved

CC-MAIN-2024-51. , see. ia-web-commons#18. All. WAT. files from. CC-MAIN-2013-20. until. CC-MAIN-2024-46. are affected. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new? The host-level graph now includes hosts visited by the crawler but not linking to any other host.