Investigator(s)

Dataset

Platform

Purpose of this experiment

The purpose of this experiment is to test the performance of two different approaches of implementing a large-scale ARC to WARC migration workflow.

The first approach is a native Map/Reduce application (ARC2WARC-HDP) and is using a Hadoop-InputFormat for reading ARC files and a Hadoop-Output-Format for writing WARC files. And the second approach is a Java-based command line executable (ARC2WARC-TOMAR) which directly transforms ARC to WARC files and uses the SCAPE tool ToMaR to make this process scalable.

The main question which this experiment should help to answer is whether the a native Map/Reduce job implementation has a significant performance advantage compared to using ToMaR with an underlying command line tool execution.

The Hadoop-version has an important limitation: In order to do the transformation based on a native Map/Reduce implementation it is required to use a Hadoop representation of a web archive record. This is the intermediate representation that is between reading the records from the ARC files and writing the records to WARC files. As it uses a byte array field to store web archive record payload content, there is a theoretical limit of around 2 GB due to the Integer length of the byte array which would be a value near Integer.MAXVALUE. Anyhow, the practical limitation of payload content size will be much lower depending on hardware setup and configuration of the cluster.

The performance advantage should be "significant" because the fact that using native Map/Reduce implementation would mean that it is not possible to process container files that have large record payload content. As a consequence, an alternative solution for these cases would be needed. And such a separation between "small" and "large" container would bring other difficulties, especially when it is required to include contextual information in the migration process, like crawl information that relates to a set of container files which must be available during the migration process.

The implementations used to do the migration are proof-of-concept tools which means that they are not intended to be used to run a production migration at this stage. This means that there are the following limiations:

As already mentioned, there is a file size limit regarding the in-memory representation of a web archive record, the largest ARC file in the data sets used in these experiments is around 300MG, therefore record-payload content can be easily stored as byte array fields.

Exceptions are catched and logged, but there is no gathering of processing errors or any other analytic results. As the focus lies here on the performance evaluation, any details regarding the record processing are not taken into consideration.

The current implementations do not include any quality assurance, like comparing digest information of payload content or doing rendering tests and taking snapshots which can be comapred, for example.

Contextual information is not being taken into consideration. From a long-term preservation perspective, a real benefit of the ARC to WARC transformation would be to include contextual information, like information in the crawl log files, for example, so that as a result of the ARC to WARC migration, the WARC files would be the only files that need to be preserved.

Reading web archive content is a big effort in terms of the amount of data that must be read and maybe even transferred in a cluster or cloud environment first. While we are at it, we should therefore take the opportunity to run other processes as well. For that reason, the proof-of-concept implementations include Apache Tika as an example of a process which does payload content identification as an optional feature. All Hadoop job executions are tested with and without payload content identification enabled.

The proof-of-concept implementations use the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files and to iterate over the records.

Workflows

The baseline evaluation was done by executing the hawarp/arc2warc-migration-cli java application from the command line on one worker node of the cluster and serves as a point of reference for the distributed processing (column "Metric baseline" in the evaluation tables).

Without Apache Tika payload identification the command used was:

And including Apache Tika payload identification the command used was (flag -p):

ARC2WARC-HDP Workflow

The Hadoop job evaluation was done using the hawarp/arc2warc-migration-hdp executable jar:

To run the workflow with Apache Tika payload content identification, the "-p" flag is used:

The ARC2WARC-HDP workflow is based on a Java-Implementation using a Hadoop-InputFormat for reading ARC files.

To iterate over all items inside the web archive ARC container files, the native JAVA map/reduce program uses a custom RecordReader based on the Hadoop 0.20 API. The custom RecordReader enables the program to read the records natively and iterate over the archive file record by record. One ARC file is processed per map and one WARC files is produced as output of this map phase. The implementation does not use a reducer, by that way, the WARC output files is not aggregated and one WARC file is created per ARC input file.

The custom record reader of the Hadoop input format used in the implementation uses the Java Web Archive Toolkit (JWAT) for reading web archive ARC container files. The ARC records are converted to a Hadoop-record (FlatListRecord) which is the internal representation implementing the Writable or WritableComparable interface in order to provide a serializable key-value object.

Using a Hadoop output format the records are then written out to HDFS. The workflow is implemented as a native JAVA map/reduce application and allows to make use of Apache Tika™ 1.0 API (detector call) to detect the MIME type of the payload content of the records.