This is a guest blog post by Stephen Merity, a Computational Science and Engineering master's candidate at Harvard University. His graduate work centers around machine learning and data analysis on large data sets. Prior to Harvard, Stephen worked as a software engineer for Freelancer.com and as a software engineer for online education start-up Grok Learning. Stephen has a Bachelor of Information Technology (Honours First Class with University Medal) from the University of Sydney in Australia.
Wait, what's WAT, WET and WARC?
Recently CommonCrawl has switched to the Web ARChive (WARC) format. The WARC format allows for more efficient storage and processing of CommonCrawl's free multi-billion page web archives, which can be hundreds of terabytes in size.
This document aims to give you an introduction to working with the new format, specifically the difference between:
- WARC files which store the raw crawl data
- WAT files which store computed metadata for the data stored in the WARC
- WET files which store extracted plaintext from the data stored in the WARC
If you want all the nitty gritty details, the best source is the ISO standard, for which the final draft is available.
If you're more interested in diving into code, we've provided three introductory examples in Java that use the Hadoop framework to process WAT, WET and WARC.
The WARC format is the raw data from the crawl, providing a direct mapping to the crawl process. Not only does the format store the HTTP response from the websites it contacts (WARC-Type: response), it also stores information about how that information was requested (WARC-Type: request) and metadata on the crawl process itself (WARC-Type: metadata).
For the HTTP responses themselves, the raw response is stored. This not only includes the response itself, what you would get if you downloaded the file, but also the HTTP header information, which can be used to glean a number of interesting insights.
In the example below, we can see the crawler contacted http://102jamzorlando.cbslocal.com/tag/nba/page/2/ and received a HTML page in response. We can also see the page was served from the nginx web server and that a special header has been added, X-hacker, purely for the purposes of advertising to a very specific audience of programmers who might look at the HTTP headers!
WAT Response Format
WAT files contain important metadata about the records stored in the WARC format above. This metadata is computed for each of the three types of records (metadata, request, and response). If the information crawled is HTML, the computed metadata includes the HTTP headers returned and the links (including the type of link) listed on the page.
This information is stored as JSON. To keep the file sizes as small as possible, the JSON is stored with all unnecessary whitespace stripped, resulting in a relatively unreadable format for humans. If you want to inspect the JSON file yourself, use one of the many JSON pretty print tools available.
The HTTP response metadata is most likely to be of interest to CommonCrawl users. The skeleton of the JSON format is outlined below.
WET Response Format
As many tasks only require textual information, the CommonCrawl dataset provides WET files that only contain extracted plaintext. The way in which this textual data is stored in the WET format is quite simple. The WARC metadata contains various details, including the URL and the length of the plaintext data, with the plaintext data following immediately afterwards.
Processing the file format
These introductory examples include:
- Count the number of times varioustags are used across HTML on the internet using the WARC files
- Counting the number of different server types found in the HTTP headers using the WAT files
- Word count over the extracted plaintext found in the WET files
If you're using a different language, there are a number of open source libraries that handle processing these WARC files and the content they contain. These include:
- Common Crawl's Example WARC (Java & Clojure)
- WARC-Mapreduce WET/WARC processor (Java & Clojure)
- Kevin Bullaughey’s WARC & WAT tools (Go)
- Hanzo Archive's Warc Tools (Python)
- IIPC’s Web Archive Commons library for processing WARC & WAT (Java)
- Internet Archive’s Hadoop tools for bridging WARC to Pig (Java)
If in doubt, the tools provided as part of the IIPC's Web Archive Commons library are the preferred implementation.