Search results
December 24, 2014. November 2014 Crawl Archive Available. The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. Stephen Merity.…
August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.…
November 12, 2014. September 2014 Crawl Archive Available. The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. Stephen Merity.…
November 20, 2014. October 2014 Crawl Archive Available. The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. Stephen Merity.…
September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.…
January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.…
July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.…
March 26, 2014. March 2014 Crawl Data Now Available. The March crawl of 2014 is now available! The new dataset contains approximately 2.8 billion webpages and is about 223TB in size. Common Crawl Foundation.…
January 8, 2014. Winter 2013 Crawl Data Now Available. The second crawl of 2013 is now available! In late November, we published the data from the first crawl of 2013.…
December 10, 2014. Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.…
March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.…
May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges. Sebastian Nagel.…
With David Waxman, Gil launched TenOneTen Ventures in 2013 and raised their first true fund in 2014. He joined TenOneTen on a full time basis upon relinquishing his CEO role in 2020. Gil has also been very active in the non-profit arena.…
Pete Warden is CEO at Useful Sensors, was previously technical lead of the TensorFlow Micro team at Google, and founder of Jetpac, a deep learning technology startup acquired by Google in 2014.…
February 20, 2014. Common Crawl's Move to Nutch. Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.…
February 18, 2015. WikiReverse- Visualizing Reverse Links with the Common Crawl Archive. This is a guest blog post by Ross Fairbanks, a software developer based in Barcelona. He mainly develops in Ruby and is interested in open data and cloud computing.…
August 29, 2014. Web Data Commons Extraction Framework for the Distributed Processing of CC Data.…
February 8, 2018. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2017-2018. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2017 and January 2018.…
February 20, 2019. Host- and Domain-Level Web Graphs Nov/Dec/Jan 2018 - 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of November, December 2018 and January 2019.…
CC-MAIN-2016-36. to. CC-MAIN-2016-50. , and. CC-MAIN-2018-34. to. CC-MAIN-2019-47. the fetch_time metadata for. robots.txt. might be incorrect. The correct times can be found in. collinfo.json.…
August 20, 2015. Web Image Size Prediction for Efficient Focused Image Crawling. This is a guest blog post by Katerina Andreadou, a research assistant at CERTH, specializing in multimedia analysis and web crawling.…
June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon. Allison Domicone.…
April 2, 2014. Navigating the WARC file format. Wait, what's WAT, WET and WARC? Recently CommonCrawl has switched to the Web ARChive (WARC) format.…
November 12, 2019. Host- and Domain-Level Web Graphs Aug/Sep/Oct 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of August, September and October 2019.…
February 4, 2014. Lexalytics Text Analysis Work with Common Crawl Data. This is a guest blog post by Oskar Singer, a Software Developer and Computer Science student at University of Massachusetts Amherst.…
Host- and Domain-Level Web Graphs Nov/Dec/Jan 2019 – 2020. We are pleased to announce a new release of host-level and domain-level web graphs based on the crawls of November, December 2019 and January 2020.…
August 8, 2019. Host- and Domain-Level Web Graphs May/June/July 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of May, June and July 2019.…
May 9, 2019. Host- and Domain-Level Web Graphs Feb/Mar/Apr 2019. We are pleased to announce a new release of host-level and domain-level web graphs based on the published crawls of February, March and April 2019.…
Overview of the original Common Crawl crawler (in use 2008-2013) discussing the Hadoop data processing pipeline, PageRank implementation, and the techniques used to optimize Hadoop. The Web of Data and Web Data Commons.…
April 1, 2015. Evaluating graph computation systems. This is a guest blog post by Frank McSherry, a computer science researcher active in the area of large scale data analysis.…
August 15, 2015. July 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. Stephen Merity.…
May 28, 2015. April 2015 Crawl Archive Available. The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. Stephen Merity.…
March 31, 2015. February 2015 Crawl Archive Available. The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. Stephen Merity.…
March 4, 2015. January 2015 Crawl Archive Available. The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. Stephen Merity.…
July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.…
July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.…
July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.…
May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.…
October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.…
November 16, 2015. September 2015 Crawl Archive Now Available. As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for September 2015 is now available!…
December 18, 2015. November 2015 Crawl Archive Now Available. As an interim crawl engineer for CommonCrawl, I am pleased to announce that the crawl archive for November 2015 is now available!…
September 16, 2016. August 2016 Crawl Archive Now Available. The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
March 1, 2019. February 2019 crawl archive now available. The crawl archive for February 2019 is now available! It contains 2.9 billion web pages or 225 TiB of uncompressed content, crawled between February 15th and 24th. Sebastian Nagel.…
September 28, 2019. September 2019 crawl archive now available. The crawl archive for September 2019 is now available! It contains 2.55 billion web pages or 240 TiB of uncompressed content, crawled between September 15th and 24th.…
October 7, 2016. September 2016 Crawl Archive Now Available. The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
October 3, 2018. September 2018 crawl archive now available. The crawl archive for September 2018 is now available! It contains 2.8 billion web pages and 220 TiB of uncompressed content, crawled between September 17th and 26th. Sebastian Nagel.…
September 29, 2017. September 2017 Crawl Archive Now Available. The crawl archive for September 2017 is now available! The archive contains 3.01 billion web pages and over 250 TiB of uncompressed content. Sebastian Nagel.…
January 29, 2018. January 2018 Crawl Archive Now Available. The crawl archive for January 2018 is now available! The archive contains 3.4 billion web pages and 270 TiB of uncompressed content, crawled between January 16th and Jan 24th. Sebastian Nagel.…
July 2, 2018. June 2018 Crawl Archive Now Available. The crawl archive for June 2018 is now available! The archive contains 3.05 billion web pages and 235 TiB of uncompressed content, crawled between June 18th and 25th. Sebastian Nagel.…
August 9, 2016. July 2016 Crawl Archive Now Available. The crawl archive for July 2016 is now available! The archive contains more than 1.73 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
February 1, 2017. January 2017 Crawl Archive Now Available. The crawl archive for January 2017 is now available! The archive contains more than 3.14 billion web pages and about 250 TiB of uncompressed content. Sebastian Nagel.…
April 1, 2019. March 2019 crawl archive now available. The crawl archive for March 2019 is now available! It contains 2.55 billion web pages or 210 TiB of uncompressed content, crawled between March 18th and 27th. Sebastian Nagel.…
January 28, 2019. January 2019 crawl archive now available. The crawl archive for January 2019 is now available! It contains 2.85 billion web pages or 240 TiB of uncompressed content, crawled between January 15th and 24th. Sebastian Nagel.…
June 1, 2018. May 2018 Crawl Archive Now Available. The crawl archive for May 2018 is now available! The archive contains 2.75 billion web pages and 215 TiB of uncompressed content, crawled between May 20th and 28th. Sebastian Nagel.…
May 31, 2019. May 2019 crawl archive now available. The crawl archive for May 2019 is now available! It contains 2.65 billion web pages or 220 TiB of uncompressed content, crawled between May 19th and 27th. Sebastian Nagel.…
July 2, 2019. June 2019 crawl archive now available. The crawl archive for June 2019 is now available! It contains 2.6 billion web pages or 220 TiB of uncompressed content, crawled between June 16th and 27th with an operational break from 21st to 24th.…
November 7, 2016. October 2016 Crawl Archive Now Available. The crawl archive for October 2016 is now available! The archive contains more than 3.25 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
December 16, 2016. December 2016 Crawl Archive Now Available. The crawl archive for December 2016 is now available! The archive contains more than 2.85 billion web pages. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.…
July 31, 2017. July 2017 Crawl Archive Now Available. The crawl archive for July 2017 is now available! The archive contains 2.89 billion+ web pages and over 240 TiB of uncompressed content. Sebastian Nagel.…
December 19, 2019. December 2019 crawl archive now available. The crawl archive for December 2019 is now available! It contains 2.45 billion web pages or 234 TiB of uncompressed content, crawled between December 5th and 16th.…