Using Wayback Machine for Research

The following is a guest post by Nicholas Taylor, Information Technology Specialist for the Repository Development Group at the Library of Congress.

Prompted by questions from Library of Congress staff on how to more effectively use web archives to answer research questions, I recently gave a presentation on “Using Wayback Machine for Research” (PDF). I thought that readers of The Signal might be interested in this topic as well. This post covers the outline of the presentation.

While the Internet Archive has been primarily responsible for the development of Wayback Machine, it is an open source project. Internet Archive also devised the name “Wayback Machine;” it is a reference to The Rocky & Bullwinkle Show’s homophonous “WABAC” Machine, a time machine itself named in the convention of mid-century mainframe computers (e.g., ENIAC, UNIVAC, MANIAC, etc.). The contemporary Wayback Machine thus appropriately evokes both the idea of traveling back in time and powerful computing technology (necessary for web archiving).

Internet Archive’s Wayback Machine is just one among many, however; over half of the web archiving initiatives listed on Wikipedia provide access via Wayback Machine. It is the most common software used to “replay” the contents of ISO-standard Web ARChive (WARC) file containers.

Wayback Machine performs this feat by dynamically rewriting the links it encounters on archived webpages to point to other resources in the archive. It does an admirable job at this, but, with as much variation as it encounters between websites, it may have trouble replaying particular webpages or webpage elements. JavaScript-driven features, for example, are especially problematic.

Understanding the basic mechanics of Wayback Machine makes it easier to navigate around within a web archive. For example, the URL can be modified to request particular resources, show the time coverage for particular resources in the archive, or show all archived resources from a particular domain. Since Wayback Machine can only replay specifically-requested URLs, it is difficult to access past versions of a webpage if that webpage changed URLs at some point and there was no redirect in place.

The presentation offers a couple of examples of how these basic techniques could be used to find specific information in a web archive. The first example explores a strategy for finding a webpage whose historical URL is unknown by navigating to another webpage in the archive that is likely to link to it. The second example demonstrates that the conceptual organization of websites persists longer than their precise URL structure. This trend can be used to access content that was previously publicly available but has since been moved to a private section of a website.

Of course, it may not even be necessary to consult web archives in the first place. Recent research (PDF) suggests that ostensibly missing resources on the live web have more often been moved than removed. The Synchronicity Firefox add-on, based on technology from the NDIIPP-fundedMemento project, leverages web archives to help locate the resource’s new location. If that fails, the MementoFox Firefox add-on can help to find the web archive with the best coverage for the desired resource and time range.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Find NDIIPP on:

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.