Download Your Archived Web Data with WASAPI

WASAPI stands for Web Archiving Systems API (application programming interface), and was originally developed as part of the IMLS-funded initiative, "Systems Interoperability and Collaborative Development for Web Archiving" (related documents and code). Archive-It implemented the WASAPI specification to create an API and related tooling to facilitate the programmatic transfer of archived web data between Archive-It and local systems. Partners can use WASAPI to query Archive-It collections for a list of preservation WARC files and their associated metadata, and to request the creation of derivative datasets intended to enable research and computational analysis of web archives.

Using WASAPI generally requires some prior knowledge of the command line. There are some WASAPI functions that can be performed in a browser (checking a running job’s status, and downloading the requested files), but far more functionality is possible by using the command line or scripts to interact with the API.

Frequently partners use WASAPI to request WARCs (and related datasets) that fit specific criteria, such as file name, file type, or from a certain collection, specific crawl, or time period. For example, you could request a list of all WARC files from collection 123, from crawls run between May 1, 2017 to May 1, 2018. This request would result in a response that lists the relevant files, their metadata, and their location for download.

Download WARCs via the command line

In general, a software developer or engineer will be able to work with the WASAPI API to create a script or utility for programmatically downloading WARC files. Sample utilities from Stanford and UNT are in the Github project. This section is specific to downloading WARCS only via very basic command-line scripts. Because WASAPI provides information in JSON, downloading WARCs is a two-step process that first filters the response information to find the specific file locations, then runs a command to downloads that list via the wget tool. To do this, it’s necessary to run jq, a JSON processor, which can be download from any package manager or installer, such as Homebrew.

The first command listed below will use jq to filter the WARCs, and the second command will download them to a specific directory of your choosing.

This example could be used to download all WARCs from a collection:

To filter the WARCS, enter the below command via command line, making sure to amend the username, password, and collection to your own information:

Then to download the WARCS, run the below command, making sure to amend the username, and password to your own information:

wget --USERNAME= --PASSWORD= --accept txt,gz -i url.list

Request derivative dataset files

This is a basic example to submit a request for the creation of derivative dataset files from a set of WARC files. This examples requests WAT files from all collections in your web archive from the time between May 10, 2016 and May 12, 2017. Please be sure to amend the username, password to your own information, and tweak the crawl dates to your specific use case:

This command submits a job for the datasets to be created. Once a job is running, you can check the status via the browser or the command line. The state will be complete, running, queued, or failed. Once the job is complete, you will receive an email letting you know that the files are available. Once available, they will be listed in the API and can be downloaded using similar commands for downloading WARC files.