The Archives Unleashed Toolkit

Introduction

Internet Archive servers in San Francisco, photo by Ian Milligan.

The Archives Unleashed Toolkit is an open-source platform for managing web archives built on Hadoop. The platform provides a flexible data model for storing and managing raw content as well as metadata and extracted knowledge. Tight integration with Hadoop provides powerful tools for analytics and data processing via Spark.

Getting Started

Quick Start

If you don’t want to install all the dependencies locally, you can use docker-aut. You can run the bleeding edge version of aut with docker run --rm -it archivesunleashed/docker-aut or a specific version of aut, such as 0.17.0 with docker run --rm -it archivesunleashed/docker-aut:0.17.0. More information on using docker-aut, such as mounting your own data, can be found here.

Dependencies

For Mac OS: You can find information on Java here, or install with homebrew and then:

brew cask install java8

For Linux: You can install Java using apt:

apt install openjdk-8-jdk

Before Spark Shell can launch, JAVA_HOME must be set. If you recieve an error that JAVA_HOME is not set, you need to point it to where Java is installed. On Linux, this might be export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 or on Mac OS it might be export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home.

In the above case, you give Spark 4GB of memory to execute the program.

In some other cases, despite giving AUT sufficient memory, you may still encounter Java Heap Space issues. In those cases, it is worth trying to lower the number of worker threads. When running locally (i.e. on a single laptop, desktop, or server), by default AUT runs a number of threads equivalent to the number of cores in your machine.

On a 16-core machine, you may want to drop to 12 cores if you are having memory issues. This will increase stability but decrease performance a bit.

You can do so like this (example is using 12 threads on a 16-core machine):

If you continue to have errors, you may also want to increase the network timeout value. Once in a while, AUT might get stuck on an odd record and take longer than normal to process it. The --conf spark.network.timeout=10000000 will ensure that AUT continues to work on material, although it may take a while to process. This command then works:

In the above example, """....""" declares that we are working with a regular expression, .r says turn it into a regular expression, .findAllIn says look for all matches in the URL. This will only return the first but that is generally good for our use cases. Finally, .toList turns it into a list so you can flatMap.

Plain Text Extraction

All plain text

This script extracts the crawl date, domain, URL, and plain text from HTML files in the sample ARC data (and saves the output to out/).

If you wanted to use it on your own collection, you would change “src/test/resources/arc/example.arc.gz” to the directory with your own ARC or WARC files, and change “out/” on the last line to where you want to save your output data.

Note that this will create a new directory to store the output, which cannot already exist.

Plain text by domain

The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a filter string. In the example case, it will go through the collection and find all of the URLs within the “archive.org” domain.

Plain text by URL pattern

The following Spark script generates plain text renderings for all the web pages in a collection with a URL matching a regular expression pattern. In the example case, it will go through a WARC file and find all of the URLs beginning with http://archive.org/details/, and save the text of those URLs.

Plain text minus boilerplate

The following Spark script generates plain text renderings for all the web pages in a collection, minus “boilerplate” content: advertisements, navigational elements, and elements of the website template. For more on the boilerplate removal library we are using, please see this website and paper.

Plain text filtered by date

AUT permits you to filter records by a list of full or partial date strings. It conceives
of the date string as a DateComponent. Use keepDate to specify the year (YYYY), month (MM),
day (DD), year and month (YYYYMM), or a particular year-month-day (YYYYMMDD).

The following Spark script extracts plain text for a given collection by date (in this case, April 2008).

Note: if you created just a dump of plain text using another one of the earlier commands, you do not need to go back and run this. You can instead use bash to extract a sample of text. For example, running this command on a dump of all plain text stored in alberta_education_curriculum.txt:

There is also discardContent which does the opposite, if you have a frequent keyword you are not interested in.

Raw HTML Extraction

In most cases, users will be interested in working with plain text. In some cases, however, you may want to work with the acutal HTML of the pages themselves (for example, looking for specific tags or HTML content).

The following script will produce the raw HTML of a WARC file. You can use the filters from above to filter it down accordingly by domain, language, etc.

Named Entity Recognition

NER is Extremely Resource Intensive and Time Consuming

Named Entity Recognition is extremely resource intensive, and will take a very long time. Our recommendation is to begin testing NER on one or two WARC files, before trying it on a larger body of information. Depending on the speed of your system, it can take a day or two to process information that you are used to working with in under an hour.

Note the call to addFile(). This is necessary if you are running this script on a cluster; it puts a copy of the classifier on each worker node. The classifier and input file paths may be local or on the cluster (e.g., hdfs:///user/joe/collection/).

The output of this script and the one below will consist of lines that look like this:

Analysis of Site Link Structure

Site link structures can be very useful, allowing you to learn such things as:

what websites were the most linked to;

what websites had the most outbound links;

what paths could be taken through the network to connect pages;

what communities existed within the link structure?

Most of the following examples show the domain to domain links. For example, you discover how many times that liberal.ca linked to twitter.com, rather than learning that http://liberal.ca/contact linked to http://twitter.com/liberal_party. The reason we do that is that in general, if you are working with any data at scale, the sheer number of raw URLs can become overwhelming.

We do provide one example below that provides raw data, however.

Extraction of Simple Site Link Structure

If your web archive does not have a temporal component, the following Spark script will generate the site-level link structure.

Note also that ExtractLinks takes an optional third parameter of a base URL. If you set this – typically to the source URL –
ExtractLinks will resolve a relative path to its absolute location. For example, if
val url = "http://mysite.com/some/dirs/here/index.html" and val html = "... <a href='../contact/'>Contact</a> ...", and we call ExtractLinks(url, html, url), the list it returns will include the
item (http://mysite.com/a/b/c/index.html, http://mysite.com/a/b/contact/, Contact). It may
be useful to have this absolute URL if you intend to call ExtractDomain on the link
and wish it to be counted.

Exporting as TSV

Archive records are represented in Spark as tuples,
and this is the standard format of results produced by most of the scripts presented here
(e.g., see above). It may be useful, however, to have this data in TSV (tab-separated value)
format, for further processing outside AUT. The following script uses tabDelimit (from
TupleFormatter) to transform tuples to tab-delimited strings; it also flattens any
nested tuples. (This is the same script as at the top of the page, with the addition of the
third and the second-last lines.)

Twitter Analysis

AUT also supports parsing and analysis of large volumes of Twitter JSON. This allows you to work with social media and web archiving together on one platform. We are currently in active development. If you have any suggestions or want more features, feel free to pitch in at our AUT repository.

Gathering Twitter JSON Data

To gather Twitter JSON, you will need to use the Twitter API to gather information. We recommend twarc, a “command line tool (and Python library) for archiving Twitter JSON.” Nick Ruest and Ian Milligan wrote an open-access article on using twarc to archive an ongoing event, which you can read here.

For example, with twarc, you could begin using the searching API (stretching back somewhere between six and nine days) on the #elxn42 hashtag with:

twarc.py --search "#elxn42" > elxn42-search.json

Or you could use the streaming API with:

twarc.py --stream "#elxn42" > elxn42-stream.json

Functionality is similar to other parts of AUT, but note that you use loadTweets rather than loadArchives.

Basic Twitter Analysis

With the ensuing JSON file (or directory of JSON files), you can use the following scripts. Here we’re using the “top ten”, but you can always save all of the results to a text file if you desire.

Parsing JSON

What if you want to do more and access more data inside tweets?
Tweets are just JSON objects, see examples
here and
here. Twitter has detailed
API documentation that
tells you what all the fields mean.

The Archives Unleashed Toolkit internally uses
json4s to access fields in
JSON. You can manipulate fields directly to access any part of tweets.
Here are some examples:

We are currently developing support for DataFrames. This is still under active development, so syntax may change. We have an open thread in our GitHub repository if you would like to add any suggestions, thoughts, or requests for this functionality.

You will note that right now we do not support everything in DataFrames: we do not support plain text extraction, named entity recognition, or Twitter analysis.

Here we provide some documentation on how to use DataFrames in AUT.

List of Domains

As with the RDD implementation, the first stop is often to work with the frequency of domains appearing within a web archive. You can see the schema that you can use when working with domains by running the following script:

You may want to save the images to work with them on your own file system. The following command will save the images from an ARC or WARC. Note that the trailing / is important for the saveToDisk command below. Without it, files will be saved with the prefix provided after the last / in the string.

For example, below, this would generate files such as prefix-c7ee6d7c17045495e.jpg and prefix-a820ac93e2a000c9d.gif in the /path/to/export/directory/ directory.