Search results

Common Crawl - Blog - March/April 2023 crawl archive now available

March/April 2023 crawl archive now available. The crawl archive for March/April 2023 is now available! The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - January 2022 crawl archive now available

February 2, 2022. January 2022 crawl archive now available. The crawl archive for January 2022 is now available! The data was crawled January 16 – 29 and contains 2.95 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - May 2022 crawl archive now available

June 2, 2022. May 2022 crawl archive now available. The crawl archive for May 2022 is now available! The data was crawled May 16 – 29 and contains 3.45 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - January 2021 crawl archive now available

February 2, 2021. January 2021 crawl archive now available. The crawl archive for January 2021 is now available! The data was crawled between January 15th and 28th and contains 3.4 billion web pages or 350 TiB of uncompressed content.

Common Crawl - Erratum - ARC Format (Legacy) Crawls

ARC Format (Legacy) Crawls. Our early crawls were archived using the ARC (Archive) format, not the WARC (Web ARChive) format. The ARC format, which predates WARC, was the initial format used for storing web crawl data.

Common Crawl - Blog - September 2019 crawl archive now available

September 2019 crawl archive now available. The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th.

Common Crawl - Blog - September 2021 crawl archive now available

September 2021 crawl archive now available. The crawl archive for September 2021 is now available! The data was crawled Sept 16 – 29 and contains 2.95 billion web pages or 310 TiB of uncompressed content.

Common Crawl - Blog - October 2021 crawl archive now available

October 2021 crawl archive now available. The crawl archive for October 2021 is now available! The data was crawled Oct 15 – 28 and contains 3.3 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - May 2021 crawl archive now available

May 2021 crawl archive now available. The crawl archive for May 2021 is now available! The data was crawled May 5 – 19 and contains 2.6 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - November/December 2020 crawl archive now available

November/December 2020 crawl archive now available. The crawl archive for November/December 2020 is now available! The data was crawled between November 23 and December 6 and contains 2.64 billion web pages or 270 TiB of uncompressed content.

Common Crawl - Blog - June/July 2022 crawl archive now available

June/July 2022 crawl archive now available. The crawl archive for June/July 2022 is now available! The data was crawled June 24 – July 7 and contains 3.1 billion web pages or 370 TiB of uncompressed content.

Common Crawl - Blog - September 2020 crawl archive now available

September 2020 crawl archive now available. The crawl archive for September 2020 is now available! The data was crawled between September 18th and October 2nd and contains 3.45 billion web pages or 345 TiB of uncompressed content.

Common Crawl - Blog - June 2021 crawl archive now available

June 2021 crawl archive now available. The crawl archive for June 2021 is now available! The data was crawled June 12 – 25 and contains 2.45 billion web pages or 260 TiB of uncompressed content.

Common Crawl - Blog - April 2021 crawl archive now available

April 2021 crawl archive now available. The crawl archive for April 2021 is now available! The data was crawled April 10 – 23 and contains 3.1 billion web pages or 320 TiB of uncompressed content.

Common Crawl - Blog - January/February 2023 crawl archive now available

January/February 2023 crawl archive now available. The crawl archive for January/February 2023 is now available! The data was crawled January 26 – February 9 and contains 3.15 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - May/June 2023 crawl archive now available

May/June 2023 crawl archive now available. The crawl archive for May/June 2023 is now available! The data was crawled May 27 – June 11 and contains 3.1 billion web pages or 390 TiB of uncompressed content.

Common Crawl - Blog - August 2022 crawl archive now available

August 2022 crawl archive now available. The crawl archive for August 2022 is now available! The data was crawled August 7 – 20 and contains 2.55 billion web pages or 295 TiB of uncompressed content.

Common Crawl - Blog - October 2020 crawl archive now available

October 2020 crawl archive now available. The crawl archive for October 2020 is now available! The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - November/December 2022 crawl archive now available

November/December 2022 crawl archive now available. The crawl archive for November/December 2022 is now available! The data was crawled November 26 – December 10 and contains 3.35 billion web pages or 420 TiB of uncompressed content.

Common Crawl - Blog - February/March 2021 crawl archive now available

February/March 2021 crawl archive now available. The crawl archive for February/March 2021 is now available! The data was crawled between February 24th and March 9th and contains 2.7 billion web pages or 280 TiB of uncompressed content.

Common Crawl - Blog - September/October 2022 crawl archive now available

September/October 2022 crawl archive now available. The crawl archive for September/October 2022 is now available! The data was crawled September 24 – October 8 and contains 3.15 billion web pages or 380 TiB of uncompressed content.

Common Crawl - Blog - July/August 2021 crawl archive now available

July/August 2021 crawl archive now available. The crawl archive for July/August 2021 is now available! The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content.

Common Crawl - Blog - November/December 2021 crawl archive now available

November/December 2021 crawl archive now available. The crawl archive for November/December 2021 is now available! The data was crawled Nov 26 – Dec 9 and contains 2.5 billion web pages or 280 TiB of uncompressed content.

Common Crawl - News Crawl

News Crawl. News is a text genre that is often discussed on our. user and developer mailing list. Yet our monthly crawl and release schedule is not well-adapted to this type of content which is based on developing and current events.

Common Crawl - Overview

The Common Crawl corpus contains petabytes of data, regularly collected since 2008. Choose a crawl. The corpus contains raw web page data, metadata extracts, and text extracts.

Common Crawl - Blog - November/December 2023 Crawl Archive Now Available

November/December 2023 Crawl Archive Now Available. The crawl archive for November/December 2023 is now available. The data was crawled between November 28th and December 12th, and contains 3.35 billion web pages (or 454 TiB of uncompressed content).

Common Crawl - Blog - June 2018 Crawl Archive Now Available

July 2, 2018. June 2018 Crawl Archive Now Available. The crawl archive for June 2018 is now available! The archive contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th. Sebastian Nagel.

Common Crawl - Blog - September/October 2023 crawl archive now available

September/October 2023 crawl archive now available. The crawl archive for September/October 2023 is now available! The data was crawled Sept 21 – October 5 and contains 3.4 billion web pages or 456 TiB of uncompressed content. Julien Nioche.

Common Crawl - Blog - February/March 2024 Crawl Archive Now Available

February/March 2024 Crawl Archive Now Available. The crawl archive for February/March 2024 is now available. The data was crawled between February 20th and March 5th, and contains 3.16 billion web pages (or 424.7 TiB of uncompressed content).

Common Crawl - Blog - December 2018 crawl archive now available

December 2018 crawl archive now available. The crawl archive for December 2018 is now available! It contains 3.1 billion web pages or 250 TiB of uncompressed content, crawled between December 9th and 19th. Sebastian Nagel.

Common Crawl - Blog - Common Crawl's First In-House Web Graph

Sebastian is a Distinguished Engineer with Common Crawl. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

Common Crawl - Blog - October 2018 crawl archive now available

October 2018 crawl archive now available. The crawl archive for October 2018 is now available! It contains 3.0 billion web pages and 240 TiB of uncompressed content, crawled between October 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.

Common Crawl - Blog - February 2019 crawl archive now available

February 2019 crawl archive now available. The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - Common Crawl's Move to Nutch

Common Crawl Foundation. Common Crawl - Open Source Web Crawling data‍. Last year we transitioned from our custom crawler to the. Apache Nutch. crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Common Crawl - Blog - September 2018 crawl archive now available

September 2018 crawl archive now available. The crawl archive for September 2018 is now available! It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th. Sebastian Nagel.

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl

Common Crawl