Search results
Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.…
CC cannot guarantee the truthfulness, authenticity, quality, lawfulness or accuracy of the Crawled Content.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.…
crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.…
crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.…
More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.…
More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.…
CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.…
crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.…
CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.…
CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.…
CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.…
CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
crawl-data/CC-MAIN-2015-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-35/segment.paths.gz). all WARC files. (CC-MAIN-2015-35/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-11/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-11/segment.paths.gz). all WARC files. (CC-MAIN-2015-11/warc.paths.gz). all WAT files.…
CC-MAIN-2024-51. , see. ia-web-commons#18. All. WAT. files from. CC-MAIN-2013-20. until. CC-MAIN-2024-46. are affected. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started. AI Agent.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new? The host-level graph now includes hosts visited by the crawler but not linking to any other host.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which include all scripts and tools required to construct the graphs. Instructions to explore the graphs in the webgraph format are given in our collection of. webgraph notebooks.…
Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. Host-level graph.…
You may also visit the projects. cc-webgraph. and. cc-pyspark. which host all scripts and tools required to construct the graphs. What's new?…
You may also visit the projects. cc-webgraph. and. cc-pyspark. on GitHub which host all scripts and tools required to construct the graphs. What's new? Links from Content-Location and Link HTTP headers are now also used to span up the web graphs.…
Annotation for Language Identification. cc-downloader Command Line Tool. Citations Updates. Common Crawl at SXSW 2025. Software Heritage Symposium at UNESCO. NeurIPS 2024 Social with Wikimedia. Annotation for Language Identification.…
CC-MAIN-2024-42. by. commoncrawl/ia-web-commons#37. This erratum affects all crawls from. CC-MAIN-2013-20. until. CC-MAIN-2024-38. Affected Crawls. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats. Graph Stats. Errata. Resources. Get Started.…
We will refer to them as seed-crawl/CC-MAIN-2023-40 and seed-crawl/CC-MAIN-2023-50 respectively. Using two iterations is important because it gives us more insight on how the prevalence of each opt–out protocol may have changed over time.…