Common Crawl Enters A New Phase

A little under four years ago, Gil Elbaz formed the Common Crawl Foundation. He was driven by a desire to ensure a truly open web. He knew that decreasing storage and bandwidth costs, along with the increasing ease of crunching big data, made building and maintaining an open repository of web crawl data feasible. More important than the fact that it could be built was his powerful belief that it should be built. The web is the largest collection of information in human history, and web crawl data provides an immensely rich corpus for scientific research, technological advancement, and business innovation. Gil started the Common Crawl Foundation to take action on the belief that it is crucial our information-based society that web crawl data be open and accessible to anyone who desires to utilize it.

That was the inspiration phase of Common Crawl – one person with a passion for openness forming a new foundation to work towards democratizing access to web information, thereby driving a new wave of innovation. Common Crawl quickly moved into the building phase, as Gil found others who shared his belief in the open web. In 2008, Carl Malamud and Nova Spivack joined Gil to form the Common Crawl board of directors. Talented engineer Ahad Rana began developing the technology for our crawler and processing pipeline. Today, thanks to the robust system that Ahad has built, we have an open repository of crawl data that covers approximately 5 billion pages and includes valuable metadata, such as page rank and link graphs. All of our data is stored on Amazon’s S3 and is accessible to anyone via EC2.

Common Crawl is now entering the next phase – spreading the word about the open system we have built and how people can use it. We are actively seeking partners who share our vision of the open web. We want to collaborate with individuals, academic groups, small start-ups, big companies, governments and nonprofits.

Over the next several months, we will be expanding our website and using this blog to describe our technology and data, communicate our philosophy, share ideas, and report on the products of our collaborations. We will also be working to build up a GitHub repository of code that has been and can be used to work with Common Crawl data. Most important, we will be talking with the community of people who share our interests. Thinking about an application you’d like to see built on Common Crawl data? Have Hadoop scripts that could be adapted to find insightful information in the crawl data? Know of a stimulating meetup, conference or hackathon we should attend? We want to hear from you!

This is the phase where the original vision truly comes to life, and the ideas Gil Elbaz had years ago will be converted to new products and insights. To say it is an exciting time is a tremendous understatement.

Video: Gil Elbaz at Web 2.0 Summit 2011

Hear Common Crawl founder discuss how data accessibility is crucial to increasing rates of innovation as well as give ideas on how to facilitate increased access to data.