Search results

Common Crawl - Blog - Common Crawl's First In-House Web Graph

May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017

November 27, 2017. Host- and Domain-Level Web Graphs Aug/Sept/Oct 2017. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September, and October 2017.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

February 20, 2019. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.

Common Crawl - Blog - March 2019 crawl archive now available

April 1, 2019. March 2019 crawl archive now available. The crawl archive for March 2019 is now available! It contains 2.55 billion web pages or 210 TiB of uncompressed content, crawled between March 18th and 27th. Sebastian Nagel.

Common Crawl - Blog - July 2019 crawl archive now available

July 30, 2019. July 2019 crawl archive now available. The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - April 2019 crawl archive now available

April 30, 2019. April 2019 crawl archive now available. The crawl archive for April 2019 is now available! It contains 2.5 billion web pages or 198 TiB of uncompressed content, crawled between April 18th and 26th. Sebastian Nagel.

Common Crawl - Blog - February 2019 crawl archive now available

March 1, 2019. February 2019 crawl archive now available. The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - January 2019 crawl archive now available

January 28, 2019. January 2019 crawl archive now available. The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

July 28, 2018. 3.25 Billion Pages Crawled in July 2018. The crawl archive for July 2018 is now available! The archive contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23th. Sebastian Nagel.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019

November 12, 2019. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.

Common Crawl - Blog - August 2019 crawl archive now available

August 30, 2019. August 2019 crawl archive now available. The crawl archive for August 2019 is now available! It contains 2.95 billion web pages or 260 TiB of uncompressed content, crawled between August 17th and 26th. Sebastian Nagel.

Common Crawl - Blog - June 2019 crawl archive now available

July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

Common Crawl - Blog - September 2019 crawl archive now available

September 28, 2019. September 2019 crawl archive now available. The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018

May 7, 2018. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2018.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020

Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.

Common Crawl - Blog - August Crawl Archive Introduces Language Annotations

August 26, 2018. August Crawl Archive Introduces Language Annotations. The crawl archive for August 2018 is now available! It contains 2.65 billion web pages and 220 TiB of uncompressed content, crawled between August 14th and 22th.

Common Crawl - Blog - October 2019 crawl archive now available

October 29, 2019. October 2019 crawl archive now available. The crawl archive for October 2019 is now available! It contains 3.0 billion web pages or 280 TiB of uncompressed content, crawled between October 13th and 24th.

Common Crawl - Blog - December 2019 crawl archive now available

December 19, 2019. December 2019 crawl archive now available. The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2019

August 8, 2019. Host- and Domain-Level Web Graphs May/June/July 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019

May 9, 2019. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.

Common Crawl - Blog - April 2018 Crawl Archive Now Available

May 2, 2018. April 2018 Crawl Archive Now Available. The crawl archive for April 2018 is now available! The archive contains 3.1 billion web pages and 230 TiB of uncompressed content, crawled between April 19th and 27th. Sebastian Nagel.

Common Crawl - Blog - November 2019 crawl archive now available

November 27, 2019. November 2019 crawl archive now available. The crawl archive for November 2019 is now available!

Common Crawl - Blog - May 2019 crawl archive now available

May 31, 2019. May 2019 crawl archive now available. The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th. Sebastian Nagel.

Common Crawl - Blog - June 2018 Crawl Archive Now Available

July 2, 2018. June 2018 Crawl Archive Now Available. The crawl archive for June 2018 is now available! The archive contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th. Sebastian Nagel.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

November 7, 2016. October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Use Cases

Need Billions of Web Pages? Don’t Bother Crawling. Julien Nioche. AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS, AWS re:Invent 2018. Jed Sundwall, Sebastian Nagel, Dave Rocamora.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

February 8, 2018. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 26 2015

March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.

Common Crawl - Blog - New Crawl Data Available!

November 27, 2013. New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed). Common Crawl Foundation.

Common Crawl - Blog - September 2018 crawl archive now available

October 3, 2018. September 2018 crawl archive now available. The crawl archive for September 2018 is now available! It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th. Sebastian Nagel.

Common Crawl - Blog - Announcing the Common Crawl Index!

There is now an index for the Jan 2015 and Feb 2015 crawls. Going forward, a new index will be available at the same time as each new crawl.

Common Crawl - Blog - May/June 2020 crawl archive now available

It contains 2.75 billion web pages or 255 TiB of uncompressed content, crawled between May 24th and June 7th. It includes page captures of 1.2 billion URLs unknown in any of our prior crawl archives. Sebastian Nagel.

Common Crawl - Blog - Answers to Recent Community Questions

We agree that many of the pages on the web are junk, and we have no inclination to crawl a larger number of pages just for the sake of having a larger number.

Common Crawl - Blog - March 2014 Crawl Data Now Available

March 26, 2014. March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation. Common Crawl - Open Source Web Crawling data‍.

Common Crawl - Blog - WikiReverse- Visualizing Reverse Links with the Common Crawl Archive

WikiReverse. [. 1. ] is an application that highlights web pages and the Wikipedia articles they link to. The project is based on Common Crawl’s July 2014 web crawl, which contains 3.6 billion pages.

Common Crawl - Blog - December 2018 crawl archive now available

December 22, 2018. December 2018 crawl archive now available. The crawl archive for December 2018 is now available! It contains 3.1 billion web pages or 250 TiB of uncompressed content, crawled between December 9th and 19th. Sebastian Nagel.

Common Crawl - Blog - 2012 Crawl Data Now Available

July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.

Common Crawl - Blog - October 2018 crawl archive now available

October 30, 2018. October 2018 crawl archive now available. The crawl archive for October 2018 is now available! It contains 3.0 billion web pages and 240 TiB of uncompressed content, crawled between October 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - Hyperlink Graph from Web Data Commons

November 13, 2013. Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.

Common Crawl - Blog - March 2018 Crawl Archive Now Available

March 29, 2018. March 2018 Crawl Archive Now Available. The crawl archive for March 2018 is now available! The archive contains 3.2 billion web pages and 250+ TiB of uncompressed content, crawled between March 17th and 25th. Sebastian Nagel.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

October 7, 2016. September 2016 Crawl Archive Now Available. The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - OSCON 2012

June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon. Allison Domicone.

Common Crawl - Blog - Web Data Commons

Up till now, we have extracted data from two Common Crawl web corpora: One. corpus consisting of 2.5 billion HTML pages dating from 2009/2010 and a. second corpus consisting of 1.4 billion HTML pages dating from February. 2012.

Common Crawl - Blog - January 2018 Crawl Archive Now Available

January 29, 2018. January 2018 Crawl Archive Now Available. The crawl archive for January 2018 is now available! The archive contains 3.4 billion web pages and 270 TiB of uncompressed content, crawled between January 16th and Jan 24th. Sebastian Nagel.

Common Crawl - Blog - Common Crawl's Move to Nutch

February 20, 2014. Common Crawl's Move to Nutch. Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Common Crawl - Blog - April 2017 Crawl Archive Now Available

May 9, 2017. April 2017 Crawl Archive Now Available. The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - Host- and Domain-Level Web Graphs May/June/July 2018

August 12, 2018. Host- and Domain-Level Web Graphs May/June/July 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2018.

Common Crawl - Blog - March 2017 Crawl Archive Now Available

April 4, 2017. March 2017 Crawl Archive Now Available. The crawl archive for March 2017 is now available! The archive contains 3.07 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

December 16, 2011. MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl.

Common Crawl - Blog - March/April 2023 crawl archive now available

The data was crawled March 20 – April 2 and contains 3.1 billion web pages or 400 TiB of uncompressed content.

Common Crawl - Blog - July/August 2021 crawl archive now available

The data was crawled July 23 – August 6 and contains 3.15 billion web pages or 360 TiB of uncompressed content. It includes page captures of 1 billion new URLs, not visited in any of our prior crawls. Sebastian Nagel.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018

November 13, 2018. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2018.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

September 16, 2016. August 2016 Crawl Archive Now Available. The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

August 9, 2016. July 2016 Crawl Archive Now Available. The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

December 16, 2016. December 2016 Crawl Archive Now Available. The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

July 14, 2016. June 2016 Crawl Archive Now Available. The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - February 2018 Crawl Archive Now Available

March 2, 2018. February 2018 Crawl Archive Now Available. The crawl archive for February 2018 is now available! The archive contains 3.4 billion web pages and 270+ TiB of uncompressed content, crawled between February 17th and Feb 26th. Sebastian Nagel.

Common Crawl - Blog - February 2017 Crawl Archive Now Available

March 10, 2017. February 2017 Crawl Archive Now Available. The crawl archive for February 2017 is now available! The archive contains 3.08 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

June 19, 2016. May 2016 Crawl Archive Now Available. The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - January 2017 Crawl Archive Now Available

February 1, 2017. January 2017 Crawl Archive Now Available. The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content. Sebastian Nagel.