Get Started

Amazon Machine Image

Start an instance of the Common Crawl Amazon Machine Image (AMI) on Amazon EC2. This instance will show you how to submit Common Crawl data processing jobs on your own Hadoop cluster or using Amazon’s Elastic MapReduce service.

GitHub

Build the examples yourself if you already have your own Hadoop cluster or Java application. The Common Crawl User Library and Example Library are available on Github for anyone to use.

 

Discussion Board

Ask or answer questions in the Common Crawl Google Group.

 

 

 

 

 

You can also read about the Common Crawl data set on our wiki:

  • where you can get the data
  • what file formats the data is available in

If you have questions, you can check the FAQ, then post your question on our discussion board. Or, read our blog to see how the project has progressed to date.

Our inspiration and ideas page shares information about projects that are using Common Crawl data, and will hopefully inspire you to start your own project.