Trevor: Have you come across anything particularly exciting or surprising in processing born digital materials?

Donald: Honestly it’s all pretty exciting to me. It’s exciting to see applications running again, documents readable again, games playable again. Digital collections can provide a challenge of working at a scale though that doesn’t really have an analog when compared to working with traditional paper-based archives. So we’re constantly re-thinking what the ‘processing’ strategies are to make things accessible as efficiently as possible.

When you’re operating under the notion that essentially everything is going to be a challenge, nothing is really surprising. For example, take the screen-shot below; it’s a Clarion database from 1989 from the Gay Men’s Health Crisis records. A complex object like this challenges all aspects of the traditional work of an archivist: preservation, appraisal, arrangement and description and access.

Clarion database from 1989 from the Gay Men’s Health Crisis records

Trevor: Could you tell us a bit about your setup? What kind of equipment (hardware and software) are you working with?

Donald: We have two computers at the NYPL dedicated to working with born-digital archival collections. We have a laptop which functions as a floppy disk capture workstation. We have both KryoFlux and FC5025 floppy drivers to image floppy disks with this machine. It will double as a capture machine out in the field, but that opportunity has yet to arrive.

Secondly, we have a Digital Intelligence FRED (Forensic Recovery of Evidence Device) workstation. The FRED is used to image hard drives and run the application Forensic Toolkit (FTK). We also have a set of Tableau write blockers, which are extremely handy to have around.

We use a third machine, a Mac, to manage the imaging workflow, which is outlined below, and to store PREMIS data for objects within our systems.

Trevor: Could you tell us a bit about what you have been able to accomplish so far? In particular, what collections you are working with and what kinds of work you have been doing with them?

Donald: The main project we are actively working on is the Timothy Leary papers. It’s comprised of, at least, 250 floppy disks (the archivists keep finding more in the collection), consisting of Leary’s writings, work related to software development, games, etc. We’ve been imaging the collection since September and plan to finish up this summer.

Other collections with digital materials we’ve worked on are the September 11th Fund records, the Gay Men’s Health Crisis records, the Morris Dickstein papers, and the Vito Russo papers. Of these collections, Morris Dickstein’s papers have had the most work done on them. The entire collection is comprised of WordPerfect 4 and WordPerfect 5 files, which made it possible to build some automated processes around. In addition to imaging the collection’s media, the electronic components of the collection have been arranged and described by an archivist. A finding aid should be published shortly. I’ve recently built a full-text index of the papers that should allow researchers to interact with collection in ways that are not necessarily possible with paper materials.

Query “vietnam” through collection materials

Trevor: Could you tell us a bit about the processes and procedures you have put in place? What is the workflow for born digital materials in the archives? Further, what kinds of tools are you using?

Donald: Our workflow currently has three phases to it. In the ‘imaging’ phase, the digital archives staff (myself and an assistant) work on imaging the collection materials and getting them through a set of workflow steps: imaging, metadata extraction, photography, backup and preservation packaging. After the imaging workflow step we load the image into an FTK case where an archivist can determine if the image is collection material or not.

We attempt to extract filesystem metadata using FIWalk, which is now part of the Sleuthkit distribution. Failing that (the Sleuthkit somewhat notoriously neglects to support Apple HFS file systems) we generate a CSV of the image’s contents using FTK Imager. We are currently in the process of implementing Archivematica to package disk images and restored files for eventual preservation storage. I’ve also been working on producing accession reports that document the state of the digital collection upon transfer into the repository.

In the ‘analysis’ phase, the archivist processing a collection works primarily in FTK, where they carry out all of the traditional work that they would do in a paper collection: arrangement, description, separation, etc. The goal of their work is to create a set of bookmarks in FTK that will become components in a finding aid. Additionally, the archivists will look for restricted material using FTK’s various searching functionality and remove unnecessary duplication through filtering.

When the arrangement is done, the digital archives staff then works on a ‘restoration’ phase whereupon the files from a given collection are restored from its images. If possible, we create a modern access copy through migration tools and, as outlined above, we will attempt to index the textual documents.

Another desirable outcome will be to get all files into a proper preservation repository. We do not have a tried and true methodology in place for this phase as of yet. Again we find ourselves experimenting with how Archivematica can help with this part of the workflow. I’m very excited about the potential of the format registry that will be implemented with the next iteration of the system as it could realistically cut down on what individuals currently have to do when coming up with a normalization scheme for a ‘new’ format.

Trevor: While it seems like the archival community’s approaches to preserving born-digital material has come along way there are still not many archives that have put in place robust modes of access to born-digital archival collections. Is any of the material you have processed available to researchers? If so, please describe your approach to access. If not, could you tell us a bit about how you are envisioning access to these materials in the future?

Donald: We’re just now getting ready to make our first collection available to researchers. Research access to the files will be limited to using a file viewing application, Quickview Pro, in the reading room. Researchers will be able to use the finding aid and the index application outlined above to assist their work.

I think the possibility of using Virtual Machines with the prerequisite software in the course of public service could be a very real possibility and a very useful tool. I imagine the researcher being able to log into a secured VM, where they could access digital records and have the correct tools, e.g. emulators and period-specific software, to carry out their research.

As we prepare for the eventual opening of the Leary papers, emulators will probably become a necessary part of our access program. We have numerous Amiga disks that Timothy Leary and Keith Haring exchanged for an ill-fated project that will likely require proper emulation and software to access.

Trevor: For anyone interested in getting involved in working as a digital archivist what do you think the most important skills are to develop? Further, where did you pick up these skills?

Donald: Knowing the basics of computer forensics can get you a long way. You can do an awful lot with a machine with just Sleuthkit installed on it. If you’re going to be imaging a good understanding of digital preservation practices are essential. Also, having some facility with programming or scripting will be a useful skill. A number of the archivists at the NYPL have been taking the Digital Archives Specialist curriculum through the Society of American Archivists. Cal Lee’s ‘Digital Forensics for Archivists’ workshops from its curriculum are a great place to start.

My first interaction with digital archives was probably quite common among other practitioners. I was handed a stack of floppy disks one day and asked if I could get the data off of them. After tinkering for some time, I was able to take a course on computer forensics. Through this class I became familiar with forensic techniques and applications: imaging, file carving, etc. I was lucky enough to work at one of the partner institutions while the AIMS project was going on and was exposed to what the other partner institutions were doing in regards to managing digital archives.

Trevor: What do you think are the biggest things that could make you and your colleagues working with born digital materials even more productive? Are there particular kinds of tools that would help? Are there different standards or practices that you could imagine helping?

Donald: The tools clearly aren’t ‘there’ yet, at least not tools specific to our community of practice. While computer forensics tools are quite mature, they all could use a tweak to work better for digital archives. I find myself constantly finding workarounds for various tools because they were designed for legal or criminal investigations and not working with archival collections. One of the glaring problems of using these tools is lack of good documentation, particularly with the open source ones.

I think the community of practice could be most helped by having a good resource for sharing our documentation. Information documenting how to get increasingly obsolete hardware safely connected to modern systems for preservation transfer is particularly scarce. For instance, we have a 44-megabyte Syquest tape from the Leary archives that is dated from around the time of his death. I’ve already spent several hours investigating and purchasing the hardware, and I foresee it likely taking several dozen more to get everything working for transfer. I’d hope that the work I have to put into this will eventually help someone else who encounters this format in a collection.

The Bitcurator project looks very promising in terms of delivering the necessary applications and tools for working with digital collections forensically within a library or archives. The project has already documented a tag library for DFXML, a non-standardized schema used by different forensic tools (most notably FIWalk). The project is poised to deliver reporting tools that could potentially standardize some, or many, of steps in the different repositories workflows currently working with forensic tools in archival settings.

Add a Comment

This blog is governed by the general rules of respectful civil discourse. You are fully
responsible for everything that you post. The content of all comments is released into the public domain
unless clearly stated otherwise. The Library of Congress does not control the content posted. Nevertheless,
the Library of Congress may monitor any user-generated content as it chooses and reserves the right to
remove content for any reason whatever, without consent. Gratuitous links to sites are viewed as spam and
may result in removed comments. We further reserve the right, in our sole discretion, to remove a user's
privilege to post content on the Library site. Read our
Comment and Posting Policy.

Find NDIIPP on:

Disclaimer

This blog does not represent official Library of Congress communications.

Links to external Internet sites on Library of Congress Web pages do not constitute the Library's endorsement of the content of their Web sites or of their policies or products. Please read our
Standard Disclaimer.