Capturing & Archiving Web Pages: UT-Austin Library’s Web Clipper

The following was co-authored with Kevin Wood at the University of Texas Libraries at Austin. The post describes a promising experimental archiving strategy that the UT Libraries is developing for harvesting and preserving primary resources from the Web. Special thanks to Kevin for contributing his expertise and time by co-authoring this post.

–Sarah

University of Texas Libraries-Austin’s Web Clipper Project for Human Rights

Developer: Kevin Wood

Example of a Web page clipped from the web for archiving as a primary resource. Image: Kevin Wood, University of Texas Libraries-Austin

Background

In July of 2008, the University of Texas Libraries received a grant from the Bridgeway Foundation to support efforts to collect and preserve fragile records (records that are at risk of destruction either from environmental conditions or human activity) of human rights conflicts and genocide. These funds are helping the library to develop new means for collecting and cataloguing “fragile or transient Web sites of human rights advocacy and genocide watch;” sites that are important because the internet has become a primary means for distributing both information and misinformation about human rights abuses and for documenting human rights events. Thus these fragile Web sites become valuable primary resources for survivors, scholars, and activists as they pursue their work in human rights (see the library’s grant announcement for a press release on the grant).

Harvesting Web Sites for Archiving

In their first attempt to establish a reliable means for harvesting Web sites for preservation, archivists at the University of Texas Libraries used Zotero, a free Firefox extension that allows users to collect, manage and cite online resources for research. The program allows users to capture copies of webpages and catalog them in a bibliographic program that functions much like End Note or Book Ends. Archivists at the University of Texas planned to use the program to pull specific documentation of human rights events off of the internet and then submit the collected pages to their institutional repository for cataloging and preservation. However, Zotero wound up not meeting their needs. Zotero is geared toward individual work from a desktop, therefore, when it harvests a page, it changes links to be relative to the individual’s desktop rather than saving the original links as they are built into the webpage of interest—in terms of archiving and preservation, this is problematic because it calls into question the authenticity of the captured pages. Zotero can be made to keep the original links, but it was not originally designed to do so, so this becomes a cumbersome process and as Zotero continues to evolve in the direction of meeting the needs of individual users, this work-around process becomes that much more difficult to maintain.

The solution for this problem is the in-house creation of a custom web clipper program that harvests pages without modifying them. It functions as a Firefox plug-in and was built from the bottom up borrowing heavily from open source programs that already have some of the right functionality for the libraries’ human rights archiving needs. The designer wants to keep the coding footprint of the web clipper as small as possible to minimize the deployment and maintenance burden. Therefore, the main logic of the clipper will be hosted on a server and accessed on individual machines or terminals through web services. Eventually, this will allow patrons to use the clipper from anywhere in the library system as a harvesting tool. The goal is to centralize the clipping process as much as possible without the need of customizing individual machines, thus streamlining collection, cataloging, and preservation processes.

The prototype clipper is currently housed on two computers at the library in Austin and graduate research assistants are actively clipping web pages for archiving. As they clip a page (see the image above for an example of a clipped page) , users enter metadata in predetermined fields and then assign descriptive terms as tags for subject and content cataloging. Users can either select from a thesaurus of human rights terms (in this case, they are beginning with the thesaurus from WITNESS and extending it with terms as appropriate) or assign arbitrary keywords. Though users have complete control over clipping, documenting, and tagging a Web page, a moderator or manager determines if new terms should be added to the thesaurus.

Regardless of whether a new term makes it into the thesaurus, the pages clipped by users get stored in the archive. Once items are clipped and tagged with descriptive terms, they are ingested into the UT Libraries’ institutional repository, based on DSpace. Metadata are stored in the repository with a link to a local instance of Internet Archive’s Wayback Machine. These copies appear exactly as the pages appeared when the material was first clipped and submitted for preservation, thus maintaining their value as primary resources.