Search results

Common Crawl - Blog - Common Crawl's First In-House Web Graph

< Back to Blog. May 22, 2017. Common Crawl's First In-House Web Graph. We are pleased to announce the release of a host-level web graph of recent monthly crawls (February, March, April 2017). The graph consists of 385 million nodes and 2.5 billion edges.

Common Crawl - Blog - 3.25 Billion Pages Crawled in July 2018

< Back to Blog. July 28, 2018. 3.25 Billion Pages Crawled in July 2018. The crawl archive for July 2018 is now available! The archive contains 3.25 billion web pages and 255 TiB of uncompressed content, crawled between July 15th and 23th. Sebastian Nagel.

Common Crawl - Blog

Common Crawl Blog. The latest news, interviews, technologies, and resources. Common Crawl Blog. The Data. Overview. Web Graphs. Latest Crawl. Resources. Get Started. Blog. Examples. Use Cases. CCBot. Infra Status. FAQ. Community. Research Papers.

Common Crawl - Blog - Index to WARC Files and URLs in Columnar Format

< Back to Blog. March 1, 2018. Index to WARC Files and URLs in Columnar Format. We're happy to announce the release of an index to WARC files and URLs in a columnar format.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 26 2015

< Back to Blog. March 26, 2015. 5 Good Reads in Big Open Data: March 26 2015.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 13 2015

< Back to Blog. March 13, 2015. 5 Good Reads in Big Open Data: March 13 2015. Jürgen Schmidhuber- Ask Me Anything - via Reddit: Jürgen has pioneered self-improving general problem solvers and Deep Learning Neural Networks for decades.

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 6 2015

< Back to Blog. February 6, 2015. 5 Good Reads in Big Open Data: Feb 6 2015.

Common Crawl - Blog - Still time to participate in the Common Crawl code contest

< Back to Blog. August 7, 2012. Still time to participate in the Common Crawl code contest. There is still plenty of time left to participate in the Common Crawl code contest! 

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 13 2015

< Back to Blog. February 13, 2015. 5 Good Reads in Big Open Data: Feb 13 2015. What does it mean for the Open Web if users don't know they're on the internet? Via QUARTZ: “This is more than a matter of semantics.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 20 2015

< Back to Blog. March 20, 2015. 5 Good Reads in Big Open Data: March 20 2015.

Common Crawl - Blog - Big Data Week: meetups in SF and around the world

< Back to Blog. April 13, 2012. Big Data Week: meetups in SF and around the world. Big Data Week aims to connect data enthusiasts, technologists, and professionals across the globe through a series of meet-ups.

Common Crawl - Blog - 5 Good Reads in Big Open Data: Feb 20 2015

< Back to Blog. February 20, 2015. 5 Good Reads in Big Open Data: Feb 20 2015. A thriving ecosystem is the key for real viability of any technology.

Common Crawl - Blog - Gil Elbaz and Nova Spivack on This Week in Startups

< Back to Blog. January 12, 2012. Gil Elbaz and Nova Spivack on This Week in Startups.

Common Crawl - Blog - Video: This Week in Startups - Gil Elbaz and Nova Spivack

< Back to Blog. January 10, 2012. Video: This Week in Startups - Gil Elbaz and Nova Spivack.

Common Crawl - Blog - 5 Good Reads in Big Open Data: February 27 2015

< Back to Blog. February 27, 2015. 5 Good Reads in Big Open Data: February 27 2015.

Common Crawl - Blog - 5 Good Reads in Big Open Data: March 6 2015

< Back to Blog. March 6, 2015. 5 Good Reads in Big Open Data: March 6 2015. 2015: What do you think about Machines that think?

Common Crawl - Blog - Welcome, Sebastian!

< Back to Blog. May 13, 2016. Welcome, Sebastian! It is a pleasure to officially announce that Sebastian Nagel joined Common Crawl as Crawl Engineer in April.

Common Crawl - Blog - OSCON 2012

< Back to Blog. June 19, 2012. OSCON 2012. We're just one month away from one of the biggest and most exciting events of the year, O'Reilly's Open Source Convention (OSCON). This year's conference will be held July 16th-20th in Portland, Oregon.

Common Crawl - Blog - Analysis of the NCSU Library URLs in the Common Crawl Index

< Back to Blog. January 16, 2013. Analysis of the NCSU Library URLs in the Common Crawl Index. Last week we announced the Common Crawl URL Index.

Common Crawl - Blog - Amazon Web Services sponsoring $50 in credit to all contest entrants!

< Back to Blog. August 10, 2012. Amazon Web Services sponsoring $50 in credit to all contest entrants! Did you know that every entry to the First Ever Common Crawl Code Contest gets $50 in Amazon Web Services (AWS) credits?

Common Crawl - Blog - URL Search Tool!

< Back to Blog. March 5, 2013. URL Search Tool! A couple months ago we announced the creation of the Common Crawl URL Index and followed it up with a guest post by Jason Ronallo describing how he had used the URL Index.

Common Crawl - Blog - Web Data Commons

< Back to Blog. March 22, 2012. Web Data Commons.

Common Crawl - Blog - Data 2.0 Summit

< Back to Blog. March 28, 2012. Data 2.0 Summit. Next week a few members of the Common Crawl team are going the Data 2.0 Summit in San Francisco. Common Crawl Foundation. Common Crawl - Open Source Web Crawling data‍.

Common Crawl - Blog - News Dataset Available

< Back to Blog. October 4, 2016. News Dataset Available. We are pleased to announce the release of a new dataset containing news articles from news sites all over the world. Sebastian Nagel. Sebastian is a Distinguished Engineer with Common Crawl.

Common Crawl - Blog - MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

< Back to Blog. December 16, 2011. MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl.

Common Crawl - Blog - Common Crawl's Advisory Board

< Back to Blog. February 15, 2012. Common Crawl's Advisory Board. As part of our ongoing effort to grow Common Crawl into a truly useful and innovative tool, we recently formed an Advisory Board to guide us in our efforts.

Common Crawl - Blog - Common Crawl Discussion List

< Back to Blog. November 29, 2011. Common Crawl Discussion List.

Common Crawl - Blog - New Crawl Data Available!

< Back to Blog. November 27, 2013. New Crawl Data Available! We are very please to announce that new crawl data is now available! The data was collected in 2013, contains approximately 2 billion web pages and is 102TB in size (uncompressed).

Common Crawl - Blog - Evaluating graph computation systems

< Back to Blog. April 1, 2015. Evaluating graph computation systems. This is a guest blog post by Frank McSherry, a computer science researcher active in the area of large scale data analysis.

Common Crawl - Blog - March/April 2024 Newsletter

< Back to Blog. March 26, 2024. March/April 2024 Newsletter. We're excited to share an update on some of our recent projects and initiatives in this newsletter! Common Crawl Foundation. Common Crawl - Open Source Web Crawling data‍. Table of Contents.

Common Crawl - Blog - Common Crawl URL Index

< Back to Blog. January 8, 2013. Common Crawl URL Index. We are thrilled to announce that Common Crawl now has a URL index! Scott Robertson, founder of triv.io graciously donated his time and skills to creating this valuable tool. Scott Robertson.

Common Crawl - Blog - Strata Conference + Hadoop World

< Back to Blog. August 3, 2012. Strata Conference + Hadoop World. This year's Strata Conference teams up with Hadoop World for what promises to be a powerhouse convening in NYC from October 23-25.

Common Crawl - Blog - Interactive Webgraph Statistics Notebook Released

< Back to Blog. October 28, 2020. Interactive Webgraph Statistics Notebook Released.

Common Crawl - Blog - November 2014 Crawl Archive Available

< Back to Blog. December 24, 2014. November 2014 Crawl Archive Available. The crawl archive for November 2014 is now available! This crawl archive is over 135TB in size and contains 1.95 billion webpages. Stephen Merity.

Common Crawl - Blog - Navigating the WARC file format

< Back to Blog. April 2, 2014. Navigating the WARC file format. Wait, what's WAT, WET and WARC? Recently CommonCrawl has switched to the Web ARChive (WARC) format.

Common Crawl - Blog - July 2015 Crawl Archive Available

< Back to Blog. August 15, 2015. July 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 145TB in size and holds more than 1.81 billion webpages. Stephen Merity.

Common Crawl - Blog - April 2015 Crawl Archive Available

< Back to Blog. May 28, 2015. April 2015 Crawl Archive Available. The crawl archive for April 2015 is now available! This crawl archive is over 168TB in size and holds more than 2.11 billion webpages. Stephen Merity.

Common Crawl - Blog - June 2015 Crawl Archive Available

< Back to Blog. July 23, 2015. June 2015 Crawl Archive Available. The crawl archive for June 2015 is now available! This crawl archive is over 131TB in size and holds more than 1.67 billion webpages. Stephen Merity.

Common Crawl - Blog - December 2014 Crawl Archive Available

< Back to Blog. January 9, 2015. December 2014 Crawl Archive Available. The crawl archive for December 2014 is now available! This crawl archive is over 160TB in size and contains 2.08 billion webpages. Stephen Merity.

Common Crawl - Blog - Please Donate To Common Crawl!

< Back to Blog. December 10, 2014. Please Donate To Common Crawl! Big data has the potential to change the world. The talent exists and the tools are already there. What’s lacking is access to data.

Common Crawl - Blog - August 2014 Crawl Data Available

< Back to Blog. September 22, 2014. August 2014 Crawl Data Available. The August crawl of 2014 is now available! The new dataset is over 200TB in size containing approximately 2.8 billion webpages. Stephen Merity.

Common Crawl - Blog - March 2015 Crawl Archive Available

< Back to Blog. May 20, 2015. March 2015 Crawl Archive Available. The crawl archive for March 2015 is now available! This crawl archive is over 124TB in size and holds more than 1.64 billion webpages. Stephen Merity.

Common Crawl - Blog - 2012 Crawl Data Now Available

< Back to Blog. July 16, 2012. 2012 Crawl Data Now Available. I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages. Common Crawl Foundation.

Common Crawl - Blog - May 2015 Crawl Archive Available

< Back to Blog. July 8, 2015. May 2015 Crawl Archive Available. The crawl archive for May 2015 is now available! This crawl archive is over 159TB in size and holds more than 2.05 billion webpages. Stephen Merity.

Common Crawl - Blog - Announcing the Common Crawl Index!

< Back to Blog. April 8, 2015. Announcing the Common Crawl Index! This is a guest post by Ilya Kreymer, a dedicated volunteer who has gifted large amounts of time, effort and talent to Common Crawl.

Common Crawl - Blog - January 2015 Crawl Archive Available

< Back to Blog. March 4, 2015. January 2015 Crawl Archive Available. The crawl archive for January 2015 is now available! This crawl archive is over 139TB in size and contains 1.82 billion webpages. Stephen Merity.

Common Crawl - Blog - Answers to Recent Community Questions

< Back to Blog. November 16, 2011. Answers to Recent Community Questions. In this post we respond to the most common questions. Thanks for all the support and please keep the questions coming! Common Crawl Foundation.

Common Crawl - Blog - August 2015 Crawl Archive Available

< Back to Blog. October 10, 2015. August 2015 Crawl Archive Available. The crawl archive for August 2015 is now available! This crawl archive is over 149TB in size and holds more than 1.84 billion webpages. Stephen Merity.

Common Crawl - Blog - April 2014 Crawl Data Available

< Back to Blog. July 16, 2014. April 2014 Crawl Data Available. The April crawl of 2014 is now available! The new dataset is over 183TB in size containing approximately 2.6 billion webpages. Stephen Merity.

Common Crawl - Blog - Winners of the Code Contest!

< Back to Blog. September 18, 2012. Winners of the Code Contest! We’re very excited to announce the winners of the First Ever Common Crawl Code Contest! We were thrilled by the response to the contest and the many great entries.

Common Crawl - Blog - Oct/Nov 2023 Performance Issues

< Back to Blog. November 15, 2023. Oct/Nov 2023 Performance Issues. Our datasets have become very popular over time, with downloads doubling every 6 months for several years in a row.

Common Crawl - Blog - Web Archiving File Formats Explained

< Back to Blog. March 1, 2024. Web Archiving File Formats Explained. In the ever–evolving landscape of digital archiving and data analysis, it is helpful to understand the various file formats used for web crawling.

Common Crawl - Blog - July 2014 Crawl Data Available

< Back to Blog. August 7, 2014. July 2014 Crawl Data Available. The July crawl of 2014 is now available! The new dataset is over 266TB in size containing approximately 3.6 billion webpages. Stephen Merity.

Common Crawl - Blog - February 2015 Crawl Archive Available

< Back to Blog. March 31, 2015. February 2015 Crawl Archive Available. The crawl archive for February 2015 is now available! This crawl archive is over 145TB in size and over 1.9 billion webpages. Stephen Merity.

Common Crawl - Blog - Common Crawl's Move to Nutch

< Back to Blog. February 20, 2014. Common Crawl's Move to Nutch. Last year we transitioned from our custom crawler to the Apache Nutch crawler to run our 2013 crawls as part of our migration from our old data center to the cloud.

Common Crawl - Blog - October 2014 Crawl Archive Available

< Back to Blog. November 20, 2014. October 2014 Crawl Archive Available. The crawl archive for October 2014 is now available! This crawl archive is over 254TB in size and contains 3.72 billion webpages. Stephen Merity.

Common Crawl - Blog - September 2014 Crawl Archive Available

< Back to Blog. November 12, 2014. September 2014 Crawl Archive Available. The crawl archive for September 2014 is now available! This crawl archive is over 220TB in size and contains 2.98 billion webpages. Stephen Merity.

Common Crawl - Blog - September 2017 Crawl Archive Now Available

< Back to Blog. September 29, 2017. September 2017 Crawl Archive Now Available. The crawl archive for September 2017 is now available! The archive contains 3.01 billion web pages and over 250 TiB of uncompressed content. Sebastian Nagel.

Common Crawl - Blog - August 2016 Crawl Archive Now Available

< Back to Blog. September 16, 2016. August 2016 Crawl Archive Now Available. The crawl archive for August 2016 is now available! The archive contains more than 1.61 billion web pages. Sebastian Nagel.

Common Crawl - Blog - September 2016 Crawl Archive Now Available

< Back to Blog. October 7, 2016. September 2016 Crawl Archive Now Available. The crawl archive for September 2016 is now available! The archive contains more than 1.72 billion web pages. Sebastian Nagel.