Search results
Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.…
You shall indemnify, defend and hold harmless CC and its officers, directors, shareholders, members, managers, employees and agents from all out-of-pocket costs, damages, losses, judgments, fines, and expenses (including reasonable attorneys' fees) (collectively…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.…
The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).…
New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.…
The December 2024 crawl archive is located in the. commoncrawl. bucket with the prefix: crawl-data/CC-MAIN-2024-51/. To assist with exploring and using the dataset, we provide. gzip. -compressed files which list all segments, WARC. , WAT. and. WET. files.…
Nov/Dec/Jan 2018/2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains. a random sample of outlinks…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-17/. It contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th.…
New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.…
New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.…
Aug/Sep/Oct 2018 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 50 million hosts and domains. a random sample of outlinks taken…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-22/. It contains 2.75 billion web pages and 215 TiB of uncompressed content, crawled between May 20th and 28th.…
Feb/Mar/Apr 2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-30/. It contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23rd.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.…
The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.…
crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.…
New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.…
crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.…
randomly selected samples of. 2 million human-readable sitemap pages (HTML format). 2 million URLs of pages written in 130 less-represented languages (cf. language distributions. ). 900 million URLs extracted and sampled from 20 million. sitemaps. , RSS and Atom feeds…
crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.…
Feb/Mar/Apr 2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.…
More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.…
The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.…
Nov/Dec/Jan 2018/2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 3 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million…
CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.…
crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.…
More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.…
CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.…
randomly selected samples of. 2 million human-readable sitemap pages (HTML format). 3 million URLs of pages written in 130 less-represented languages (cf. language distributions. ). 1 billion URLs extracted and sampled from 20 million. sitemaps. , RSS and Atom feeds…
CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.…
crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.…
crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.…
CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.…
CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.…