Digital Longevity

Digital Preservation

This page provides a brief overview of the wide-ranging issues which are encountered in preserving digital records.

There are severe problems attempting to keep digital data for more than a period of five to ten years, but this has only recently been recognised. A lot of theoretical advance has been made in the last few years—but there is only rare experience of putting this knowledge into practice.

The central problem is the continual change in information technologies leading to their successive obsolescence. This technical obsolescence problem is compounded by the need to provide continuing care for digital data—its curation.

Put a book on an archive shelf and as long as it is not interfered with, the book will remain accessible and readable. A digital file on some medium—a floppy disk say—put on the same shelf will very quickly become inaccessible.

Technical obsolescence

This affects:

Storage media (and hardware to read media). Few media will survive for more than a decade or so and remain readable; longevity is increased by adopting good practice in use and for storage. Even if the medium itself remains viable there is doubt about the hardware to read it remaining available and maintained.

Applications software used to create, process data and display (or “render”) it are replaced every few years. All computers users are aware of the problems of reading old files as versions of software change and whole systems are replaced.

Systems software and middleware required by applications to run in specific environments change too.

Hardware architectures are required by these software systems. After a relatively short while the machines on which the software is designed to run are no longer available, supported or maintained. The programs can no longer be run, and the data bit streams that need these programs to make sense of them are orphaned—they are effectively out of reach.

The first of these problems is tackled relatively easily—the bit streams of which data is comprised can be copied without loss to new media, as long as this is done in good time and good data copying practices are employed.

At the moment there are few methods for the long-term preservation of digital content (information) and systems behaviours (applications) over time as successive hardware and software technologies to read and interpret bit-streams become obsolete. There are variants within each option, but they may be summarised as follows:

Migration: This requires transforming data from one format to another successively as technologies change. This is a well understood process, but generally it loses information and can be expensive and time-consuming. Costs are recurring and errors are cumulative. It is sometimes referred to as conversion.

Emulation: This entails keeping the original data and application software and creating programs as when needed which emulate the behaviours of successive computer systems, thus enabling the original application and data to be processed – emulated – on contemporary architectures. This may prove more cost effective than migration, and promises more faithful preservation of both content and behaviours.

Formal descriptions: The use of a formal descriptions, such as a Universal Virtual Computer (UVC) has been proposed. The behaviours of the original application are encoded at the originating time using a formal language which can be understood by the UVC in the future; the abstract UVC is designed so that a real, functioning instance of it will be easy to create in the future and which will be able to emulate the original application on contemporary architectures. This method, and variants of it, are still in development.

Digital archaeology: : Analogous to the recovery of physical artefacts, it involves recovery in the future on an as needed or exploratory basis. It transfers cost to the future at the high risk of loss, or future misinterpretation.

Computer museums: This strategy proposes to archive whole systems, including hardware and systems software, so that they can be used in the future. Continuing costs, dwindling available expertise and physical decay of hardware will limit this approach. (However, it was, essentially, one of the suggestions from the USA’s Food and Drug Administration guidance to 21 CFR Part 11—guidance which has since been withdrawn for review.)

Increasing difficulty: : preservation is becoming ever more complex, despite the use of open standards, with ever more heterogeneous data types, multimedia, linked structures, dynamic and distributed data.

Curation—continuing care

Digital data needs continuing care; we call this its curation. Not only does it require the interventions to preserve its content and behaviours as described above, it require continual management of:

Media—to make sure it is still viable, and to make copies as media technologies change;

Systems to provide access to the information;

Monitoring for signs of decay in media and software systems;

Information which is needed to help run old systems or to interpret encoded data structures and meanings;

These needs in turn raise issues of institutional longevity to provide for this continuing care, as well as related questions about continuing financial provision, possible returns from the kept data. Thus:

Appraisal—why data is to be kept and for how long;

Disposition—when data is to be destroyed and how thoroughly the record of its existence should be expunged.

Lastly, we need to consider issues about the status of the information: to what extent maintenance of the integrity of content and behaviours is important; maintaining security, confidentiality, authenticity, access controls and audit trails of use and change.