Saving the Smithsonian’s Web

The Smithsonian Institution has had a presence on the Internet for more than sixteen years. It’s come a long way since then. Documenting the Smithsonian’s various websites falls under the purview of the Smithsonian Institution Archives...but how do we do it?

As a web preservation intern at the Archives this summer, I’ve helped to develop the workflow for preserving Smithsonian-affiliated web content. Our goal is to take an annual “snapshot” of all Smithsonian public websites to be kept in the Archives.

While each unit or office within the Smithsonian maintains and backs up the web content they create, the best way for the Archives to get a comprehensive snapshot of all the websites as they appear online is to use a web crawler. Crawlers, or spiders, are programs that browse the Internet by following trails of links, typically to index or save the content they encounter.

We use Heritrix, open-source crawling software developed by the Internet Archive, to conduct focused captures of individual websites according to our specifications and schedule. Heritrix bundles all the web content it crawls into .WARC files, an archival file format

We need special software to view the content of the WARC files and perform a quality control check to make sure everything looks right and nothing is missing. We’re using the Wayback application, also developed by the Internet Archive. The local application looks and acts just like the Wayback Machine online. Once we’re satisfied with the captured website, we accession the WARC files and they’re officially part of the Archives’ holdings.

Future researchers will also have to use Wayback or other WARC-reading software to view preserved web collections. They might be interested in the content of web-published news releases, the structure of the Smithsonian’s extensive online image collections, or what was deemed worthy of a blog post (!).

Issues encountered

The road to web preservation is not without a few bumps. A few issues we’ve encountered are:

Estimating the size of site. Seemingly small, innocuous websites can actually contain many thousands of documents. One of the largest single crawls so far was the website of the National Museum of Natural History’s Botany department, which took 49 hours and 57 minutes to capture 78,922 files. To budget our time, we need to estimate how big a website is, and we use specific software tools like link validation programs to do that.

Deciding what external content to capture. How do you tell a web crawler that you want it to follow a link in a blog post to a useful article elsewhere on the Smithsonian website, but not to follow a link to a spam site in the comments? For blogs, we configure Heritrix to accept embedded off-domain content, like photos from Flickr, but not to scrape linked off-domain sites. For non-blog Smithsonian sites, we don’t capture any off-domain content at all. In both instances, we can also specify any URL patterns that are acceptable.

We’re still learning how best to use these tools to fit the needs of the Archives, and in the past two months, we’ve made a lot of progress: