Search results

Common Crawl - Blog - May 2021 crawl archive now available

May 23, 2021. May 2021 crawl archive now available. The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - January 2021 crawl archive now available

February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.

Common Crawl - Blog - June 2021 crawl archive now available

June 28, 2021. June 2021 crawl archive now available. The crawl archive for June 2021 is now available! The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content.

Common Crawl - Blog - April 2021 crawl archive now available

April 27, 2021. April 2021 crawl archive now available. The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - October 2021 crawl archive now available

November 1, 2021. October 2021 crawl archive now available. The crawl archive for October 2021 is now available! The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - September 2021 crawl archive now available

October 4, 2021. September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.

Common Crawl - Blog - February/March 2021 crawl archive now available

March 14, 2021. February/March 2021 crawl archive now available. The crawl archive for February/March 2021 is now available! The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - November/December 2021 crawl archive now available

December 14, 2021. November/December 2021 crawl archive now available. The crawl archive for November/December 2021 is now available! The data was crawled Nov 26 – Dec 9 and contains 2.5 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - July/August 2021 crawl archive now available

August 9, 2021. July/August 2021 crawl archive now available. The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021

February 10, 2021. Host- and Domain-Level Web Graphs October, November/December 2020 and January 2021.

Common Crawl - Blog - Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022

March 16, 2022. Host- and Domain-Level Web Graphs October, November/December 2021 and January 2022. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of October, November/December 2021 and January 2022.

Common Crawl - Blog - Host- and Domain-Level Web Graphs February/March, April and May 2021

May 31, 2021. Host- and Domain-Level Web Graphs February/March, April and May 2021. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February/March, April and May 2021.

Common Crawl - Blog - Host- and Domain-Level Web Graphs June, July/August and September 2021

October 8, 2021. Host- and Domain-Level Web Graphs June, July/August and September 2021. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of June, July/August and September 2021.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023

March 15, 2023. Host- and Domain-Level Web Graphs September/October, November/December 2022 and January/February 2023.

Common Crawl - Blog - Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024

March 14, 2024. Host- and Domain-Level Web Graphs September/October, November/December 2023 and February/March 2024.

Common Crawl - Blog - March/April 2024 Newsletter

March 26, 2024. March/April 2024 Newsletter. We're excited to share an update on some of our recent projects and initiatives in this newsletter! Common Crawl Foundation. Common Crawl - Open Source Web Crawling data‍. Table of Contents. Web Graphs.

Common Crawl - Blog - Oct/Nov 2023 Performance Issues

November 15, 2023. Oct/Nov 2023 Performance Issues. Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row. This post details some steps to take if you are impacted by performance issues.

Common Crawl - Blog - October 2020 crawl archive now available

November 7, 2020. October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - August 2020 crawl archive now available

August 19, 2020. August 2020 crawl archive now available. The crawl archive for August 2020 is now available! It contains 2.45 billion web pages or 235 TiB of uncompressed content, crawled between August 2nd and 15th.

Common Crawl - Blog - January 2022 crawl archive now available

February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - August 2022 crawl archive now available

August 22, 2022. August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.

Common Crawl - Blog - July 2020 crawl archive now available

July 20, 2020. July 2020 crawl archive now available. The crawl archive for July 2020 is now available! It contains 3.14 billion web pages or 300 TiB of uncompressed content, crawled between July 2nd and 16th.

Common Crawl - Blog - February 2020 crawl archive now available

March 4, 2020. February 2020 crawl archive now available. The crawl archive for February 2020 is now available! It contains 2.6 billion web pages or 240 TiB of uncompressed content, crawled between February 16th and 29th.

Common Crawl - Blog - May 2022 crawl archive now available

June 2, 2022. May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - September 2020 crawl archive now available

October 7, 2020. September 2020 crawl archive now available. The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content.

Common Crawl - Blog - January 2020 crawl archive now available

February 3, 2020. January 2020 crawl archive now available. The crawl archive for January 2020 is now available! It contains 3.1 billion web pages or 300 TiB of uncompressed content, crawled between January 17th and 29th.

Common Crawl - Blog - May/June 2023 crawl archive now available

June 21, 2023. May/June 2023 crawl archive now available. The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content.

Common Crawl - Blog - September/October 2023 crawl archive now available

October 12, 2023. September/October 2023 crawl archive now available. The crawl archive for September/October 2023 is now available! The data was crawled Sept 21 – October 5 and contains 3.4 billion web pages or 456 TiB of uncompressed content.

Common Crawl - Blog - March/April 2023 crawl archive now available

April 6, 2023. March/April 2023 crawl archive now available. The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - February/March 2024 Crawl Archive Now Available

March 11, 2024. February/March 2024 Crawl Archive Now Available. The crawl archive for February/March 2024 is now available.

Common Crawl - Blog - November/December 2022 crawl archive now available

December 14, 2022. November/December 2022 crawl archive now available. The crawl archive for November/December 2022 is now available! The data was crawled November 26 – December 10 and contains 3.35 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - September/October 2022 crawl archive now available

October 11, 2022. September/October 2022 crawl archive now available. The crawl archive for September/October 2022 is now available! The data was crawled September 24 – October 8 and contains 3.15 billion web pages or 380 TiB of uncompressed content.

Common Crawl - Blog - January/February 2023 crawl archive now available

February 16, 2023. January/February 2023 crawl archive now available. The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Erratum - Some 2–Level CCTLDs Excluded

A bad configuration was checked into our exclusion list on Sep 22, 2022 and was fixed on Oct 27, 2023. The configuration blocked a number of 2–level domains, meaning they were not included in certain crawls.

Common Crawl - Blog - November/December 2023 Crawl Archive Now Available

December 15, 2023. November/December 2023 Crawl Archive Now Available. The crawl archive for November/December 2023 is now available.

Common Crawl - Blog - June/July 2022 crawl archive now available

July 13, 2022. June/July 2022 crawl archive now available. The crawl archive for June/July 2022 is now available! The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content.

Common Crawl - Blog - March/April 2020 crawl archive now available

April 14, 2020. March/April 2020 crawl archive now available. The crawl archive for March/April 2020 is now available! It contains 2.85 billion web pages or 280 TiB of uncompressed content, crawled between March 28th and April 10th.

Common Crawl - Blog - November/December 2020 crawl archive now available

December 10, 2020. November/December 2020 crawl archive now available. The crawl archive for November/December 2020 is now available!

Common Crawl - Blog - May/June 2020 crawl archive now available

June 10, 2020. May/June 2020 crawl archive now available. The crawl archive for May/June 2020 is now available! It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/May 2020

June 16, 2020. Host- and Domain-Level Web Graphs Feb/Mar/May 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of February, March/April and May/June 2020.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020

October 16, 2020. Host- and Domain-Level Web Graphs Jul/Aug/Sep 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of July, August and September 2020.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Mar/May/Oct 2023

October 18, 2023. Host- and Domain-Level Web Graphs Mar/May/Oct 2023. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of March, May, and October 2023.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/Sep/Nov 2023

December 22, 2023. Host- and Domain-Level Web Graphs May/Sep/Nov 2023. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, September, and November of 2023. Thom Vaughan.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

February 10, 2020. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May, June/July and August 2022

September 23, 2022. Host- and Domain-Level Web Graphs May, June/July and August 2022. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of May, June/July and August 2022.

Common Crawl - Blog - A Further Look Into the Prevalence of Various ML Opt–Out Protocols

February 22, 2024. A Further Look Into the Prevalence of Various ML Opt–Out Protocols. This post details some experiments that we have done regarding Machine Learning Opt–Out protocols.

Common Crawl - Team - Gil Elbaz

In 2020, Factual merged with Foursquare and today Gil is Co-Chairman of the board of a combined entity which generated $150m in combined revenue at the time of the merger.

Common Crawl - Get Started

WARC/1.0. content-type: application/http; msgtype=response. content-length: 583626. warc-ip-address: 208.80.154.224. warc-identified-payload-type: text/html. warc-payload-digest: sha1:A2QAZF3MHWNIQMX4YAGEY4LZX7Z5IVKE. warc-date: 2023-09-29T08:25:05Z. warc-concurrent-to

Common Crawl - Team - Greg Lindahl

Before joining Common Crawl full-time in 2023, Greg was a member of the Event Horizon Telescope Collaboration, working at the Center for Astrophysics - Harvard & Smithsonian. He has also contributed to the Wayback Machine at the Internet Archive.

Common Crawl - Blog - Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data

March 1, 2022. Introducing CloudFront as a new way to access Common Crawl data as part of Amazon Web Services’ registry of open data. Ten years ago(!) Common Crawl joined AWS’s Open Data Sponsorships program, hosted on S3, with free access to everyone.

Common Crawl - Blog - Bridging Digital Exploration and Scientific Frontiers

October 10, 2023. Bridging Digital Exploration and Scientific Frontiers. This month Common Crawl Foundation members had the privilege of attending 5th International Open Search Symposium at CERN in Geneva, Switzerland. Thom Vaughan.

Common Crawl - Blog - Interactive Webgraph Statistics Notebook Released

October 28, 2020. Interactive Webgraph Statistics Notebook Released. We are pleased to announce the release of an interactive Jupyter notebook that is used to provide visualization of webgraph statistics, and a way to interact with the webgraph. Alex Xue.

Common Crawl - Privacy Policy

LAT UPDATED: 23 March 2023. INTERPRETATION AND DEFINITIONS. INTERPRETATION. The words of which the initial letter is capitalized have meanings defined under the following conditions.

Common Crawl - Blog - Balancing Discovery and Privacy: A Look Into Opt–Out Protocols

February 13, 2024. Balancing Discovery and Privacy: A Look Into Opt–Out Protocols. What opt–out protocols are, their importance, how you can use them, how we respect them, and what the emerging initiatives are that surround them. Alex Xue.

Common Crawl - Terms of Use

LAST UPDATED: March 7, 2024. Welcome to the commoncrawl.org website (the "Site").