Announcing Common Crawl

Several years ago my friend Gil Elbaz (CEO of Factual; forefather of Google AdWords) approached me with an ambitious vision – he wanted to create an open not-for-profit crawl of the Web to ensure that everyone would have equal access to a Web-scale search index to build on and experiment with.

Search giants like Google and Microsoft were not likely to provide open access to their search indices because they couldn’t risk giving their crown jewels to potential competitors, and furthermore they were bound by the constraints of for-profit business models.

Gil felt that in the future it would be an important service to provide a truly open Web-scale search index that was not controlled by a for-profit company and was not bound by profit motives. This index would make it possible for startups to innovate in search, and for researchers and students to explore Web Science at scale, and furthermore it would level the playing field in search and distribute the index, preventing any one company from monopolizing the index of humanity’s knowledge.

As a longtime advocate of the open Web, I was excited by the vision Gil shared with me, and agreed to join the board of directors of what became The Common Crawl Foundation, along with Carl Malamud. Gil and lead engineer, Ahad Rana, then went to work actually building the thing. This was no small undertaking and required quite a bit of innovation and ingenuity. You can read about the cloud based solution that was developed here.

Several years later, after a lot of work, it’s starting to be ready for Prime Time, and so we’re happy to announce the Web’s first truly open, non-profit, 5 billion page search index!

With the recent addition of our director, Lisa Green, from Creative Commons, Common Crawl is now beginning a new phase in its rollout, and a new phase for the open Web. You can read our inaugural blog post announcing the project here.

We hope you will come in and take a look around, and we look forward to seeing what you dream up and build with this data set.