Start an instance of the Common Crawl Amazon Machine Image (AMI) on Amazon EC2. This instance will show you how to submit Common Crawl data processing jobs on your own Hadoop cluster or using Amazon’s Elastic MapReduce service.
- where you can get the data
- what file formats the data is available in
Our inspiration and ideas page shares information about projects that are using Common Crawl data, and will hopefully inspire you to start your own project.