Search results

Common Crawl - Blog - February 2016 Crawl Archive Now Available

February 29, 2016. February 2016 Crawl Archive Now Available. As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for February 2016 is now available! This crawl archive holds more than 1.73 billion urls.

Common Crawl - Blog - April 2016 Crawl Archive Now Available

May 24, 2016. April 2016 Crawl Archive Now Available. The crawl archive for April 2016 is now available! More than 1.33 billion webpages are in the archive. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

September 16, 2016. August 2016 Crawl Archive Now Available. The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

October 7, 2016. September 2016 Crawl Archive Now Available. The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - June 2016 Crawl Archive Now Available

July 14, 2016. June 2016 Crawl Archive Now Available. The crawl archive for June 2016 is now available! The archive contains more than 1.23 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - May 2016 Crawl Archive Now Available

June 19, 2016. May 2016 Crawl Archive Now Available. The crawl archive for May 2016 is now available! More than 1.46 billion web pages are in the archive. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - July 2016 Crawl Archive Now Available

August 9, 2016. July 2016 Crawl Archive Now Available. The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - October 2016 Crawl Archive Now Available

November 7, 2016. October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - December 2016 Crawl Archive Now Available

December 16, 2016. December 2016 Crawl Archive Now Available. The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Erratum - Charset Detection Bug in WET Records

The charset detection required to properly transform non-UTF-8 HTML pages in WARC files into WET records didn't work before November 2016 due to a bug in. IIPC Web Archive Commons. (see the. related issue. in the CC fork of Apache Nutch).

Common Crawl - Blog - Data Sets Containing Robots.txt Files and Non-200 Responses

September 16, 2016. Data Sets Containing Robots.txt Files and Non-200 Responses.

Common Crawl - Erratum - Incorrect fetch_time metadata

CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.

Common Crawl - Blog - News Dataset Available

October 4, 2016. News Dataset Available. We are pleased to announce the release of a new dataset containing news articles from news sites all over the world. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - Common Crawl's First In-House Web Graph

May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.

Common Crawl - Blog - Welcome, Sebastian!

May 13, 2016. Welcome, Sebastian! It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April. Sebastian brings to Common Crawl a unique blend of experience, skills, knowledge (and enthusiasm!)

Common Crawl - Blog - July/August 2021 crawl archive now available

The change reduces the size of the. robots.txt subset (since August 2016). by removing content which should not contained in this dataset. Archive Location and Download.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018

February 8, 2018. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.

Common Crawl - Blog - Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019

February 20, 2019. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.

Common Crawl - Blog - OSCON 2012

June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon. Allison Domicone.

Common Crawl - Blog - Winter 2013 Crawl Data Now Available

January 8, 2014. Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.

Common Crawl - Blog - December 2014 Crawl Archive Available

January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.

Common Crawl - Blog - March 2014 Crawl Data Now Available

March 26, 2014. March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation.

Common Crawl - Blog - The Norvig Web Data Science Award

November 15, 2012. The Norvig Web Data Science Award. We are very excited to announce the Norvig Web Data Science Award! Common Crawl and SARA created the award to encourage research in web data science. Common Crawl Foundation.

Common Crawl - Blog - A Look Inside Our 210TB 2012 Web Corpus

August 13, 2013. A Look Inside Our 210TB 2012 Web Corpus. Want to know more detail about what data is in the 2012 Common Crawl corpus without running a job? Now you can thanks to Sebastian Spiegler! Common Crawl Foundation.

Common Crawl - Use Cases

Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop. The Web of Data and Web Data Commons.

Common Crawl - Blog - Web Data Commons

March 22, 2012. Web Data Commons. For the last few months, we have been talking with Chris Bizer and Hannes Mühleisen at the Freie Universität Berlin about their work and we have been greatly looking forward the announcement of the Web Data Commons.

Common Crawl - Blog - Hyperlink Graph from Web Data Commons

November 13, 2013. Hyperlink Graph from Web Data Commons. The talented team at Web Data Commons recently extracted and analyzed the hyperlink graph within the Common Crawl 2012 corpus.

Common Crawl - Blog - Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data

August 14, 2013. Startup Profile: SwiftKey’s Head Data Scientist on the Value of Common Crawl’s Open Data. Sebastian Spiegler is the head of the data team and SwiftKey and a volunteer at Common Crawl.

Common Crawl - Blog - February 2015 Crawl Archive Available

March 31, 2015. February 2015 Crawl Archive Available. The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. Stephen Merity.

Common Crawl - Blog - July 2014 Crawl Data Available

August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.

Common Crawl - Blog - September 2014 Crawl Archive Available

November 12, 2014. September 2014 Crawl Archive Available. The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. Stephen Merity.

Common Crawl - Blog - October 2014 Crawl Archive Available

November 20, 2014. October 2014 Crawl Archive Available. The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. Stephen Merity.

Common Crawl - Blog - January 2015 Crawl Archive Available

March 4, 2015. January 2015 Crawl Archive Available. The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. Stephen Merity.

Common Crawl - Blog - May 2015 Crawl Archive Available

July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.

Common Crawl - Blog - August 2014 Crawl Data Available

September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.

Common Crawl - Blog - 2012 Crawl Data Now Available

July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.

Common Crawl - Blog - June 2015 Crawl Archive Available

July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.

Common Crawl - Blog - March 2015 Crawl Archive Available

May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.

Common Crawl - Blog - November 2014 Crawl Archive Available

December 24, 2014. November 2014 Crawl Archive Available. The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. Stephen Merity.

Common Crawl - Blog - July 2015 Crawl Archive Available

August 15, 2015. July 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. Stephen Merity.

Common Crawl - Blog - April 2015 Crawl Archive Available

May 28, 2015. April 2015 Crawl Archive Available. The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. Stephen Merity.

Common Crawl - Blog - April 2014 Crawl Data Available

July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.

Common Crawl - Blog - August 2015 Crawl Archive Available

October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.

Common Crawl - Team - Lisa Green

Lisa was Chief of Staff at Creative Commons and served as the director of Common Crawl from 2011 to 2015. She holds a PhD in physical chemistry from the University of California Berkeley. The Data. Overview. Web Graphs. Latest Crawl. Crawl Stats.

Common Crawl - Blog - January 2017 Crawl Archive Now Available

February 1, 2017. January 2017 Crawl Archive Now Available. The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - March 2019 crawl archive now available

April 1, 2019. March 2019 crawl archive now available. The crawl archive for March 2019 is now available! It contains 2.55 billion web pages or 210 TiB of uncompressed content, crawled between March 18th and 27th. Sebastian Nagel.

Common Crawl - Blog - January 2019 crawl archive now available

January 28, 2019. January 2019 crawl archive now available. The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - May 2018 Crawl Archive Now Available

June 1, 2018. May 2018 Crawl Archive Now Available. The crawl archive for May 2018 is now available! The archive contains 2.75 billion web pages and 215 TiB of uncompressed content, crawled between May 20th and 28th. Sebastian Nagel.

Common Crawl - Blog - May 2019 crawl archive now available

May 31, 2019. May 2019 crawl archive now available. The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th. Sebastian Nagel.

Common Crawl - Blog - June 2019 crawl archive now available

July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.

Common Crawl - Blog - July 2017 Crawl Archive Now Available

July 31, 2017. July 2017 Crawl Archive Now Available. The crawl archive for July 2017 is now available! The archive contains 2.89 billion+ web pages and over 240 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - December 2019 crawl archive now available

December 19, 2019. December 2019 crawl archive now available. The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th.

Common Crawl - Blog - June 2017 Crawl Archive Now Available

July 4, 2017. June 2017 Crawl Archive Now Available. The crawl archive for June 2017 is now available! The archive contains 3.16 billion+ web pages and over 260 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - November 2017 Crawl Archive Now Available

November 29, 2017. November 2017 Crawl Archive Now Available. The crawl archive for November 2017 is now available! The archive contains 3.2 billion web pages and 260 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - December 2018 crawl archive now available

December 22, 2018. December 2018 crawl archive now available. The crawl archive for December 2018 is now available! It contains 3.1 billion web pages or 250 TiB of uncompressed content, crawled between December 9th and 19th. Sebastian Nagel.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 13 2015

March 13, 2015. 5 Good Reads in Big Open Data: March 13 2015. Jürgen Schmidhuber- Ask Me Anything - via Reddit: Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 26 2015

March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.

Common Crawl - Blog - Please Donate To Common Crawl!

December 10, 2014. Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.

Common Crawl - Blog - July 2019 crawl archive now available

July 30, 2019. July 2019 crawl archive now available. The crawl archive for July 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between July 15th and 24th. Sebastian Nagel.

Common Crawl - Blog - April 2017 Crawl Archive Now Available

May 9, 2017. April 2017 Crawl Archive Now Available. The crawl archive for April 2017 is now available! The archive contains 2.94 billion+ web pages and over 250 TiB of uncompressed content. Sebastian Nagel.