Search results
Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.…
CC cannot guarantee the truthfulness, authenticity, quality, lawfulness or accuracy of the Crawled Content.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.…
crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.…
More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.…
CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.…
CC-MAIN-2025-26. , CC-MAIN-2025-21. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.…
crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.…
More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.…
CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.…
CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2025-13. , CC-MAIN-2025-18. , and. CC-MAIN-2025-21. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.…
crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-35/segment.paths.gz). all WARC files. (CC-MAIN-2015-35/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-11/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-11/segment.paths.gz). all WARC files. (CC-MAIN-2015-11/warc.paths.gz). all WAT files.…
CC-MAIN-2024-51. , see. ia-web-commons#18. All. WAT. files from. CC-MAIN-2013-20. until. CC-MAIN-2024-46. are affected. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new? The host-level graph now includes hosts visited by the crawler but not linking to any other host.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new?…
Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new?…
You may also visit the projects. cc-webgraph. and. cc-pyspark. on GitHub which host all scripts and tools required to construct the graphs. What's new? Links from Content-Location and Link HTTP headers are now also used to span up the web graphs.…