Accessing the Data | CommonCrawl

Accessing the Data

Accessing Common Crawl Data Common Crawl is an attempt to create an open and accessible crawl of the web. This document describes the steps required to access the latest Common Crawl corpus. You can access the Hadoop classes and other code in our GitHub repository.

Background
Common Crawl is a Web Scale crawl, and as such, each version of our crawl contains billions of documents from the various sites that we are successfully able to crawl. This dataset can be tens of terabytes in size, making transfer of the crawl to interested third parties costly and impractical. In addition to this, performing data processing operations on a dataset this large requires parallel processing techniques, and a potentially large computer cluster. Luckily for us, Amazon’s EC2/S3 cloud computing infrastructure provides us with both a theoretically unlimited storage capacity coupled with localized access to an elastic compute cloud. This eliminates the need for us to transfer the data to multiple distribution points, and also allows a wide audience of users to utilize Amazon’s flexible computing resources to iterate and extract value from the crawl. The last part of this equation is Hadoop, a technology specifically developed to process large datasets using a cluster of machines. Common Crawl provides the glue code required to launch Hadoop jobs on EC2 that can run against versions of our crawl corpus residing in Amazon S3 buckets. By utilizing EC2 compute resources to access the S3 resident data, end users can bypass costly network transfer costs, and focus their financial resources on scaling their compute resources to meet their specific job requirements.

Data Processing Using Hadoop
To access the Common Crawl data, you need to run a map-reduce job against it, and, since the corpus resides on S3, you can do so by running a Hadoop cluster using Amazon’s EC2 service. This involves setting up a custom hadoop jar that utilizes our custom InputFormat class to pull data from the individual ARC files in our S3 bucket.

Economics of Accessing the Crawl 
The Common Crawl Foundation’s goals are to facilitate broad access to the crawl, and not to monetize the crawl. Thus, we provide unrestricted access to our crawl buckets. As long as you access the bucket contents from within the Amazon infrastructure, you should not incur any per-item access or bandwidth charges. You will, obviously, be responsible for any charges associated with spinning up your own EC2 cluster or storing data in your own S3 bucket.

Terms of Use
Please refer to the Common Crawl Terms of Use document for a detailed, authoritative description of our Terms of Use guidelines, but, in general, you cannot republish the data retrieved from the crawl (unless allowed by fair use), you cannot resell access to the service, you cannot use the crawl data for any illegal purposes, and you must respect the Terms of Use of the sites we crawl.

The ARC File Format 
The ARC file format used by Common Crawl was developed by the Internet Archive to store their archived crawl data. It is essentially a mutli-part gzip file, with each entry in the master gzip (ARC) file being an independent gzip stream in itself. The first entry in the ARC file contains the ARC file header itself, and can be ignored for all intents and purposes. Each subsequent entry in the ARC file is an actual crawled document, and consists of the following parts: 1. A metadata line of the format: <Document URI (utf8/urlencoded)><SPACE><IPv4 Address of the Source Server><SPACE><Fetch Time in the format yyyyMMddHHmmss><SPACE><Content Type (mimetype)><SPACE><Remaining bytes in Record (Header + Content)><LF> 2. The HTTP response header bytes, as received from the origin server, in their entirety (each entry delimited by a \r\n and terminating with a trailing \r\n). 3. The content bytes in the source encoding of the document. 4. A single line seaparator <\n> . You can also use a tool like zcat to spill the contents of an ARC file to stdout.

Location of Current Crawl
 The latest Common Crawl crawl data is stored in our s3 bucket located at: http://s3.amazonaws.com/commoncrawl-crawl-002/. This bucket is marked with Amazon Requester-Pays flag, which means all access to the bucket contents requires an an http request that is signed with your Amazon Customer Id. The bucket contents are accessible to everyone, but the Requester-Pays restriction ensures that if you access the contents of the bucket from outside the EC2 network, you are responsible for the resulting access charges. You don’t pay any access charges if you access the bucket from EC2, for example via a map-reduce job, but you still have to sign your access request. Details of the Requeser-Pays API can be found here: http://docs.amazonwebservices.com/AmazonS3/latest/dev/index.html?RequesterPaysBuckets.html

Support For Hadoop 
We provide Hadoop support in the form the ARCInputFormat class. This class implements the Hadoop InputFormat interface. You specify and configure our InputFormat class in your Hadoop job config. You implement a Hadoop Mapper that takes a Hadoop Text object as the key (the document URI) and an ArcFileItem (an abstraction for a single document within an ARC file).