Navigating the WARC file format

Wait, what’s WAT, WET and WARC?

Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl’s free multi-billion page web archives, which can be hundreds of terabytes in size.

This document aims to give you an introduction to working with the new format, specifically the difference between:

  • WARC files which store the raw crawl data
  • WAT files which store computed metadata for the data stored in the WARC
  • WET files which store extracted plaintext from the data stored in the WARC

If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.

If you’re more interested in diving into code, we’ve provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.

WARC Format

The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).

For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.

In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!

WARC/1.0
WARC-Type: response
WARC-Date: 2013-12-04T16:47:32Z
WARC-Record-ID: 
Content-Length: 73873
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: 
WARC-Concurrent-To: 
WARC-IP-Address: 23.0.160.82
WARC-Target-URI: http://102jamzorlando.cbslocal.com/tag/nba/page/2/
WARC-Payload-Digest: sha1:FXV2BZKHT6SQ4RZWNMIMP7KMFUNZMZFB
WARC-Block-Digest: sha1:GMYFZYSACNBEGHVP3YFQNOSTV5LPXNAU

HTTP/1.0 200 OK
Server: nginx
Content-Type: text/html; charset=UTF-8
Vary: Accept-Encoding
Vary: Cookie
X-hacker: If you're reading this, you should visit automattic.com/jobs and apply to join the fun, mention this header.
Content-Encoding: gzip
Date: Wed, 04 Dec 2013 16:47:32 GMT
Content-Length: 18953
Connection: close


...HTML Content...

WAT Response Format

WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.

This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.

The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.

  • Envelope
    • WARC-Header-Metadata
    • Payload-Metadata
      • HTTP-Response-Metadata
        • Headers
          • HTML-Metadata
            • Head
              • Title
              • Scripts
              • Metas
              • Links
            • Links
    • Container

WET Response Format

As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.

WARC/1.0
WARC-Type: conversion
WARC-Target-URI: http://advocatehealth.com/condell/emergencyservices3
WARC-Date: 2013-12-04T15:30:35Z
WARC-Record-ID: 
WARC-Refers-To: 
WARC-Block-Digest: sha1:3SJBHMFPOCUJEHJ7OMGVCRSHQTWLJUUS
Content-Type: text/plain
Content-Length: 5765


...Text Content...

Processing the file format

We’ve provided three introductory examples in Java for the Hadoop framework. The code also contains wrapper tools for making working with the Web Archive Commons library easier in Hadoop.

These introductory examples include:

  • Count the number of times varioustags are used across HTML on the internet using the WARC files
  • Counting the number of different server types found in the HTTP headers using the WAT files
  • Word count over the extracted plaintext found in the WET files

If you’re using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:

If in doubt, the tools provided as part of the IIPC’s Web Archive Commons library are the preferred implementation.

Stephen MerityThis is a guest blog post by Stephen Merity.
Stephen Merity is a Computational Science and Engineering master’s candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.

Video Tutorial: MapReduce for the Masses

Learn how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents. Check out the full blog post where this video originally appeared.

MapReduce for the Masses: Zero to Hadoop in Five Minutes with Common Crawl

Common Crawl aims to change the big data game with our repository of over 40 terabytes of high-quality web crawl information into the Amazon cloud, the net total of 5 billion crawled pages. In this blog post, we’ll show you how you can harness the power of MapReduce data analysis against the Common Crawl dataset with nothing more than five minutes of your time, a bit of local configuration, and 25 cents.

When Google unveiled its MapReduce algorithm to the world in an academic paper in 2004, it shook the very foundations of data analysis. By establishing a basic pattern for writing data analysis code that can run in parallel against huge datasets, speedy analysis of data at massive scale finally became a reality, turning many orthodox notions of data analysis on their head.

With the advent of the Hadoop project, it became possible for those outside the Googleplex to tap into the power of the MapReduce pattern, but one outstanding question remained: where do we get the source data to feed this unbelievably powerful tool?

This is the very question we hope to answer with this blog post, and the example we’ll use to demonstrate how is a riff on the canonical Hadoop Hello World program, a simple word counter, but the twist is that we’ll be running it against the Internet.

When you’ve got a taste of what’s possible when open source meets open data, we’d like to whet your appetite by asking you to remix this code. Show us what you can do with Common Crawl and stay tuned as we feature some of the results!

Ready to get started?  Watch our screencast and follow along below:

Step 1 – Install Git and Eclipse

We first need to install a few important tools to get started:

Eclipse (for writing Hadoop code)

How to install (Windows and OS X):

Download the “Eclipse IDE for Java developers” installer package located at:

http://www.eclipse.org/downloads/

How to install (Linux):

Run the following command in a terminal:

RHEL/Fedora

 # sudo yum install eclipse

Ubuntu/Debian

 # sudo apt-get install eclipse

Git (for retrieving our sample application)

How to install (Windows)

Install the latest .EXE from:

http://code.google.com/p/msysgit/downloads/list

How to install (OS X)

Install the appropriate .DMG from:

http://code.google.com/p/git-osx-installer/downloads/list

How to install (Linux)

Run the following command in a terminal:

RHEL/Fedora

# sudo yum install git

Ubuntu/Debian

# sudo apt-get install git

Step 2 – Check out the code and compile the HelloWorld JAR

Now that you’ve installed the packages you need to play with our code, run the following command from a terminal/command prompt to pull down the code:

# git clone git://github.com/ssalevan/cc-helloworld.git

Next, start Eclipse.  Open the File menu then select “Project” from the “New” menu.  Open the “Java” folder and select “Java Project from Existing Ant Buildfile”.  Click Browse, then locate the folder containing the code you just checked out (if you didn’t change the directory when you opened the terminal, it should be in your home directory) and select the “build.xml” file.  Eclipse will find the right targets, and tick the “Link to the buildfile in the file system” box, as this will enable you to share the edits you make to it in Eclipse with git.

We now need to tell Eclipse how to build our JAR, so right click on the base project folder (by default it’s named “Hello World”) and select “Properties” from the menu that appears.  Navigate to the Builders tab in the left hand panel of the Properties window, then click “New”.  Select “Ant Builder” from the dialog which appears, then click OK.

To configure our new Ant builder, we need to specify three pieces of information here: where the buildfile is located, where the root directory of the project is, and which ant build target we wish to execute.  To set the buildfile, click the “Browse File System” button under the “Buildfile:” field, and find the build.xml file which you found earlier.  To set the root directory, click the “Browse File System” button under the “Base Directory:” field, and select the folder into which you checked out our code.  To specify the target, enter “dist” without the quotes into the “Arguments” field.  Click OK and close the Properties window.

Finally, right click on the base project folder and select “Build Project”, and Ant will assemble a JAR, ready for use in Elastic MapReduce.

Step 3 – Get an Amazon Web Services account (if you don’t have one already) and find your security credentials

If you don’t already have an account with Amazon Web Services, you can sign up for one at the following URL:

https://aws-portal.amazon.com/gp/aws/developer/registration/index.html

Once you’ve registered, visit the following page and copy down your Access Key ID and Secret Access Key:

https://aws-portal.amazon.com/gp/aws/developer/account/index.html?action=access-key

This information can be used by any Amazon Web Services client to authorize things that cost money, so be sure to keep this information in a safe place.

Step 4 – Upload the HelloWorld JAR to Amazon S3

Uploading the JAR we just built to Amazon S3 is a lot simpler than it sounds. First, visit the following URL:

https://console.aws.amazon.com/s3/home

Next, click “Create Bucket”, give your bucket a name, and click the “Create” button. Select your new S3 bucket in the left-hand pane, then click the “Upload” button, and select the JAR you just built. It should be located here:

<your checkout dir>/dist/lib/HelloWorld.jar

Step 5 – Create an Elastic MapReduce job based on your new JAR

Now that the JAR is uploaded into S3, all we need to do is to point Elastic MapReduce to it, and as it so happens, that’s pretty easy to do too! Visit the following URL:

https://console.aws.amazon.com/elasticmapreduce/home

and click the “Create New Job Flow” button. Give your new flow a name, and tick the “Run your own application” box. Select “Custom JAR” from the “Choose a Job Type” menu and click the “Continue” button.

The next field in the wizard will ask you which JAR to use and what command-line arguments to pass to it. Add the following location:

s3n://<your bucket name>/HelloWorld.jar

then add the following arguments to it:

org.commoncrawl.tutorial.HelloWorld <your aws secret key id> <your aws secret key> 2010/01/07/18/1262876244253_18.arc.gz s3n://<your bucket name>/helloworld-out

CommonCrawl stores its crawl information as GZipped ARC-formatted files (http://www.archive.org/web/researcher/ArcFileFormat.php), and each one is indexed using the following strategy:

/YYYY/MM/DD/the hour that the crawler ran in 24-hour format/*.arc.gz

Thus, by passing these arguments to the JAR we uploaded, we’re telling Hadoop to:

1. Run the main() method in our HelloWorld class (located at org.commoncrawl.tutorial.HelloWorld)

2. Log into Amazon S3 with your AWS access codes

3. Count all the words taken from a chunk of what the web crawler downloaded at 6:00PM on January 7th, 2010

4. Output the results as a series of CSV files into your Amazon S3 bucket (in a directory called helloworld-out)

Edit 12/21/11: Updated to use directory prefix notation instead of glob notation (thanks Petar!)

If you prefer to run against a larger subset of the crawl, you can use directory prefix notation to specify a more inclusive set of data. For instance:

2010/01/07/18 – All files from this particular crawler run (6PM, January 7th 2010)

2010/ – All crawl files from 2010

Don’t worry about the continue fields for now, just accept the default values. If you’re offered the opportunity to use debugging, I recommend enabling it to be able to see your job in action. Once you’ve clicked through them all, click the “Create Job Flow” button and your Hadoop job will be sent to the Amazon cloud.

Step 6 – Watch the show

Now just wait and watch as your job runs through the Hadoop flow; you can look for errors by using the Debug button. Within about 10 minutes, your job will be complete. You can view results in the S3 Browser panel, located here. If you download these files and load them into a text editor, you can see what came out of the job. You can take this sort of data and add it into a database, or create a new Hadoop OutputFormat to export into XML which you can render into HTML with an XSLT, the possibilities are pretty much endless.

Step 7 – Start playing!

If you find something cool in your adventures and want to share it with us, we’ll feature it on our site if we think it’s cool too. To submit a remix, push your codebase to GitHub or Gitorious and send a message to our user group about it: we promise we’ll look at it.