Search results

Common Crawl - Blog - Introducing cc-downloader

Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.

Common Crawl - Erratum - Columnar Index Subsets with Fewer than 900 Partitions per Crawl

The columnar index is partitioned using Hive partitioning (. column=value. ) leading to the following structure: s3://commoncrawl/cc-index/table/cc-main/warc/. |-- crawl=CC-MAIN-2025-46. | |-- subset=crawldiagnostics. | | `--. | |-- subset=robotstxt. | | `-

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Engineer at the Common Crawl Foundation.

Common Crawl - Blog - March/April 2024 Newsletter

New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.

Common Crawl - Blog - CC-Citations: A Visualization of Research Papers Referencing Common Crawl

CC-Citations: A Visualization of Research Papers Referencing Common Crawl. We are proud to release an interactive visualization of thousands of research papers using or citing Common Crawl data. Malte Ostendorff.

Common Crawl - Terms of Use

CC cannot guarantee the truthfulness, authenticity, quality, lawfulness or accuracy of the Crawled Content.

Common Crawl - Erratum - Truncated WAT Files

Four WAT files of the March 2017 crawl (CC-MAIN-2017-13) are truncated, potentially causing an error when processing them.

Common Crawl - Erratum - Incorrect fetch_time metadata

CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Erratum - WARC revisit metadata records

CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Erratum - Missing WARC File

One WARC and WET is missing in June 2017 Crawl (CC-MAIN-2017-26). The corresponding WAT file is present, as well as the URL index entries contained in the missing WARC file. For more details, see the. release announcement in the Common Crawl Google Group.

Common Crawl - Blog - August 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - December 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - April 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.

CDXJ Index

CC-MAIN-2025-43. and query for. commoncrawl.org/get-started. This should bring back two records: {.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2025

CC-MAIN-2025-33. , CC-MAIN-2025-38. , and. CC-MAIN-2025-43. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2025 and January 2026

CC-MAIN-2025-47. , CC-MAIN-2025-51. , and. CC-MAIN-2026-04. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - March 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.

Common Crawl - Blog - June 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs March, April, and May 2025

CC-MAIN-2025-13. , CC-MAIN-2025-18. , and. CC-MAIN-2025-21. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, December 2025

CC-MAIN-2025-43. , CC-MAIN-2025-47. , and. CC-MAIN-2025-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - August 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-35/segment.paths.gz). all WARC files. (CC-MAIN-2015-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - November 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.

Common Crawl - Blog - September 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.

Common Crawl - Blog - October 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.

Common Crawl - Blog - July 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.

Common Crawl - Blog - April 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2025

CC-MAIN-2025-30. , CC-MAIN-2025-33. , and. CC-MAIN-2025-38. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, and November 2025

CC-MAIN-2025-38. , CC-MAIN-2025-43. , and. CC-MAIN-2025-47. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2025

CC-MAIN-2025-26. , CC-MAIN-2025-21. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - January 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.