A blog looking at the world from a somewhat scientific and technological perspective.

Wednesday, May 21, 2014

Sustainable Long-Term Digital Archives

How do we build long-term digital archives that are economically sustainable and technologically scalable? We could start by building five essential components: selection, submission, preservation, retrieval, and decoding.

Selection may be the least amenable to automation and the least scalable, because the decision whether or not to archive something is a tentative judgment call. Yet, it is a judgment driven by economic factors. When archiving is expensive, content must be carefully vetted. When archiving is cheap, the time and effort spent on selection may cost more than archiving rejected content. The falling price of digital storage creates an expectation of cheap archives, but storage is just one component of preservation, which itself is only one component of archiving. To increase the scalability of selection, we must drive down the cost of all other archive services.

Digital preservation is the best understood service. Archive content must be transferred periodically from old to new storage media. It must be mirrored at other locations around world to safeguard against natural and man-made disasters. Any data center performs processes like these every day.

The submission service enters bitstreams into the archive and enables future retrieval of identical copies. The decoding service extracts information from retrieved bitstreams, which may have been produced by lost or forgotten software.

We could try to eliminate the decoding service by regularly re-encoding bitstreams for current technology. While convenient for users, this approach has a weakness. If a refresh cycle should introduce an error, subsequent cycles may propagate and amplify the error, making recovery difficult. Fortunately, it is now feasible to preserve old technology using virtualization, which lets us emulate almost any system on almost any hardware. Anyone worried about the long term should consider the Chrome emulator of Amiga 500 (1987) or the Android emulator of the HP 45 calculator (1973). The hobbyists who developed these emulators are forerunners of a potential new profession. A comprehensive archive of virtual old systems is an essential enabling technology for all other digital archives.

The submission and retrieval services are interdependent. To enable retrieval, the submission service analyzes bitstreams and builds an index for the archive. When bitstreams contain descriptive metadata constructed specifically for this purpose, the process of submission is straightforward. However, archives must be able to accept any bitstream, regardless of the presence of such metadata. For bitstreams that contain a substantial amount of text, full-text indexing is appropriate. Current technology still struggles with non-text bitstreams, like images, graphics, video, or pure data.

To simplify and automate the submission service, we need the participation of software developers. Most bitstreams are produced by mass-market software such as word processors, database or spreadsheet software, video editors, or image processors. Even data produced by esoteric experiments are eventually processed by applications that still serve hundreds of thousands of specialists. Within one discipline, the number of applications rarely exceeds a few hundred. To appeal to this relatively small number of developers, who are primarily interested in solving their customers' problems, we need a better argument than “making archiving easy.”

Too few application developers are aware of their potential role in research data management. Consider, for example, an application that converts data into graphs. Although most of the graphs are discarded after a quick glance, each is one small step in a research project. With little effort, that graphing software could provide transparent support for research data management. It could reformat raw input data into a re-usable and archivable format. It could give all files it produces unique identifiers and time stamps. It could store these files in a personal repository. It could log activity in a digital lab notebook. When a file is deleted, the personal repository could generate an audit trail that conforms to discipline-specific customs. When research is published, researchers could move packages of published and supporting material from personal to institutional repositories and/or to long-term archives.

Ad-hoc data management harms the longer-term interests of individual researchers and the scholarly community. Intermediate results may be discarded before it is realized they were, after all, important. The scholarly record may not contain sufficient data for reproducibility. Research-misconduct investigations may be more complicated and less reliable.

For archivists, the paper era is far from over. During the long transition, archivists may prepare for the digital future in incremental steps. Provide personal repositories. Work with a few application developers to extend key applications to support data management. After proof of concept, gradually add more applications.

Digital archives will succeed only if they are scalable and sustainable. To accomplish this, digital archivists must simplify and automate their services by getting involved well before information is produced. Within each discipline, archives must work with researchers, application providers, scholarly societies, universities, and funding agencies to develop appropriate policies for data management and the technology infrastructure to support those policies.

About Me

Eric is a technology consultant specializing in the strategic application of new technologies in academic computing and library services.

Prior to this, Eric was the Director of Library Information Technology, a Senior Research Associate and Lecturer in Applied Mathematics at the California Institute of Technology. Eric holds a computer science degree from the K. U. Leuven (Belgium) and a Ph. D. in Mathematics from the Courant Institute, New York University, NY. He is the author of papers in scientific computing, library technology, and the graduate textbook "Concurrent Scientific Computing", published by Springer-Verlag.

He chaired the OpenURL standardization committee, which developed the OpenURL ANSI standard. He is on the Board of Directors of the Networked Digital Library of Theses and Dissertations (NDLTD).