Search results

Common Crawl - Blog - Introducing cc-downloader

Today we are happy to announce. cc-downloader. , an experimental command-line tool for downloading Common Crawl data via. https. cc-downloader. is intended to be a user-friendly and polite downloader.

Common Crawl - Terms of Use

You shall indemnify, defend and hold harmless CC and its officers, directors, shareholders, members, managers, employees and agents from all out-of-pocket costs, damages, losses, judgments, fines, and expenses (including reasonable attorneys' fees) (collectively

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July, and August 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-30, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs August, September, and October 2024

The crawls used to generate the graphs were CC-MAIN-2024-33, CC-MAIN-2024-38, and CC-MAIN-2024-42. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs April, May, and June 2024

The crawls used to generate the graphs were CC-MAIN-2024-18, CC-MAIN-2024-22, and CC-MAIN-2024-26. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September, October, November 2024

The crawls used to generate the graphs were CC-MAIN-2024-46, CC-MAIN-2024-42, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs July, August, and September 2024

The crawls used to generate the graphs were CC-MAIN-2024-30, CC-MAIN-2024-33, and CC-MAIN-2024-38. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November, and December 2024

The crawls used to generate the graphs were CC-MAIN-2024-42, CC-MAIN-2024-46, and CC-MAIN-2024-51. Thom Vaughan. Thom is Principal Technologist at the Common Crawl Foundation.

Common Crawl - Get Started

The connection to S3 should be faster and you avoid the minimal fees for inter-region data transfer (you have to send requests which are charged as outgoing traffic).

Common Crawl - Blog - March/April 2024 Newsletter

New Staff Members. New Board Member. Discord Server. Updated Legal Information. Crawl & Graph Errata. Improved Cadence. Acknowledgements. Web Graphs. Our.

Common Crawl - Blog - December 2024 Crawl Archive Now Available

The December 2024 crawl archive is located in the. commoncrawl. bucket with the prefix: crawl-data/CC-MAIN-2024-51/. To assist with exploring and using the dataset, we provide. gzip. -compressed files which list all segments, WARC. , WAT. and. WET. files.

Common Crawl - Blog - March 2019 crawl archive now available

Nov/Dec/Jan 2018/2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains. a random sample of outlinks

Common Crawl - Blog - April 2018 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-17/. It contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th.

Common Crawl - Blog - November 2018 crawl archive now available

New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.

Common Crawl - Blog - December 2018 crawl archive now available

New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.

Common Crawl - Blog - January 2019 crawl archive now available

Aug/Sep/Oct 2018 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 50 million hosts and domains. a random sample of outlinks taken

Common Crawl - Blog - May 2018 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-22/. It contains 2.75 billion web pages and 215 TiB of uncompressed content, crawled between May 20th and 28th.

Common Crawl - Blog - May 2019 crawl archive now available

Feb/Mar/Apr 2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 4 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million

Common Crawl - Blog - September 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-40/. contains more than 1.72 billion web pages.

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2018-30/. It contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23rd.

Common Crawl - Erratum - Incorrect fetch_time metadata

CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-44/. It contains more than 3.25 billion web pages. Similar to the.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

The archive is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-50/. It contains more than 2.85 billion web pages. Similar to the preceding. September. and.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-30/. contains more than 1.73 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Erratum - WARC revisit metadata records

CC-MAIN-2018-34. to. CC-MAIN-2024-46. (since. Aug 2018. ) lack the metadata record which is attached to all response records. Fixed with. CC-MAIN-2024-51. , see. commoncrawl/nutch#33. Note: before.

Common Crawl - Blog - August 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-35/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-35/segment.paths.gz). all WARC files. (CC-MAIN-2014-35/warc.paths.gz). all WAT files.

Common Crawl - Blog - October 2018 crawl archive now available

New URLs stem from: extracting and sampling URLs from. sitemaps. , RSS and Atom feeds if provided by hosts visited in prior crawls. Hosts are selected from the highest-ranking 60 million domains of the.

Common Crawl - Blog - November 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-49/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-49/segment.paths.gz). all WARC files. (CC-MAIN-2014-49/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2019 crawl archive now available

randomly selected samples of. 2 million human-readable sitemap pages (HTML format). 2 million URLs of pages written in 130 less-represented languages (cf. language distributions. ). 900 million URLs extracted and sampled from 20 million. sitemaps. , RSS and Atom feeds

Common Crawl - Blog - September 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-40/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-40/segment.paths.gz). all WARC files. (CC-MAIN-2015-40/warc.paths.gz). all WAT files.

Common Crawl - Blog - November 2015 Crawl Archive Now Available

crawl-data/CC-MAIN-2015-48/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-48/segment.paths.gz). all WARC files. (CC-MAIN-2015-48/warc.paths.gz). all WAT files.

Common Crawl - Blog - December 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-52/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-52/segment.paths.gz). all WARC files. (CC-MAIN-2014-52/warc.paths.gz). all WAT files.

Common Crawl - Blog - June 2019 crawl archive now available

Feb/Mar/Apr 2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 6 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million

Common Crawl - Blog - June 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-26/. contains more than 1.23 billion web pages. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - September 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-41/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-41/segment.paths.gz). all WARC files. (CC-MAIN-2014-41/warc.paths.gz). all WAT files.

Common Crawl - Blog - October 2014 Crawl Archive Available

crawl-data/CC-MAIN-2014-42/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-42/segment.paths.gz). all WARC files. (CC-MAIN-2014-42/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

More than 1.46 billion web pages are in the archive, which is located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-22/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April, and May 2024

CC-MAIN-2024-10. , CC-MAIN-2024-18. , and. CC-MAIN-2024-22. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs January, February, and March 2025

CC-MAIN-2025-13. , CC-MAIN-2025-08. , and. CC-MAIN-2025-05. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

CC-MAIN-2023-23. , CC-MAIN-2023-40. , and. CC-MAIN-2023-50. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - February 2016 Crawl Archive Now Available

crawl-data/CC-MAIN-2016-07/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2016-07/segment.paths.gz). all WARC files. (CC-MAIN-2016-07/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-15/. To assist with exploring and using the dataset, we've provided gzipped files that list: all segments. (CC-MAIN-2014-15/segment.paths.gz). all WARC files. (CC-MAIN-2014-15/warc.paths.gz). all WAT files.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

The archive located in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-36/. contains more than 1.61 billion web pages. To extend the seed list, we've added 50 million hosts from the. Common Search host-level pagerank data set.

Common Crawl - Blog - April 2019 crawl archive now available

Nov/Dec/Jan 2018/2019 webgraph data set. from the following sources: sitemaps. , RSS and Atom feeds. a breadth-first side crawl within a maximum of 3 links (“hops”) away from the homepages of the top 60 million hosts and domains and a random sample of 1 million

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June, and July 2024

CC-MAIN-2024-22. , CC-MAIN-2024-26. , and. CC-MAIN-2024-30. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph Releases.

Common Crawl - Blog - January 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-06/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-06/segment.paths.gz). all WARC files. (CC-MAIN-2015-06/warc.paths.gz). all WAT files.

Common Crawl - Blog - July 2014 Crawl Data Available

crawl-data/CC-MAIN-2014-23/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2014-23/segment.paths.gz). all WARC files. (CC-MAIN-2014-23/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

More than 1.33 billion webpages are in the archive, which islocated in the. commoncrawl. bucket at. crawl-data/CC-MAIN-2016-18/. To assist with exploring and using the dataset, we provide gzipped files that list: all segments.

Common Crawl - Blog - Host- and Domain-Level Web Graphs December 2024 and January/February 2025

CC-MAIN-2025-08. , CC-MAIN-2025-05. , and. CC-MAIN-2024-51. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - July 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-32/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-32/segment.paths.gz). all WARC files. (CC-MAIN-2015-32/warc.paths.gz). all WAT files.

Common Crawl - Blog - April 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-18/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-18/segment.paths.gz). all WARC files. (CC-MAIN-2015-18/warc.paths.gz). all WAT files.

Common Crawl - Blog - August 2019 crawl archive now available

randomly selected samples of. 2 million human-readable sitemap pages (HTML format). 3 million URLs of pages written in 130 less-represented languages (cf. language distributions. ). 1 billion URLs extracted and sampled from 20 million. sitemaps. , RSS and Atom feeds

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

CC-MAIN-2023-40. , CC-MAIN-2023-50. , and. CC-MAIN-2024-10. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2023, February/March 2024, and April 2024

CC-MAIN-2023-50. , CC-MAIN-2024-10. , and. CC-MAIN-2024-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. webgraph releases.

Common Crawl - Blog - June 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-27/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-27/segment.paths.gz). all WARC files. (CC-MAIN-2015-27/warc.paths.gz). all WAT files.

Common Crawl - Blog - March 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-14/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-14/segment.paths.gz). all WARC files. (CC-MAIN-2015-14/warc.paths.gz). all WAT files.

Common Crawl - Blog - May 2015 Crawl Archive Available

crawl-data/CC-MAIN-2015-22/. To assist with exploring and using the dataset, we’ve provided gzipped files that list: all segments. (CC-MAIN-2015-22/segment.paths.gz). all WARC files. (CC-MAIN-2015-22/warc.paths.gz). all WAT files.

Common Crawl - Blog - Host- and Domain-Level Web Graphs November/December 2024 and January 2025

CC-MAIN-2025-05. , CC-MAIN-2024-51. , and. CC-MAIN-2024-46. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February, March, and April 2025

CC-MAIN-2025-08. , CC-MAIN-2025-13. , and. CC-MAIN-2025-18. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. Web Graph releases.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

CC-MAIN-2023-14. , CC-MAIN-2023-23. , and. CC-MAIN-2023-40. Additional information about the data formats, the processing pipeline, our objectives, and credits can be found in the announcements of prior. web graph releases.