You are here

“Land of the lost” : a discussion of what can be preserved through digital preservation

“Land of the lost”: a discussion of what can be preserved through digital preservation

Author:

Nick del Pozo

Andrew Stawowczyk

David Pearson

Publication date:

Monday, 1 February, 2010

Abstract:

This article brings together and clarifies a number of key digital preservation theories. It proposes the concept of preservation intent: a clear articulation of a commitment to preserve an object, the specific elements of that object that should be preserved, and a clear time line for the duration of preservation. It investigates these concepts through simple and practical examples.

Introduction

One of our colleagues once remarked that to outward appearances, the field of digital preservation is like a monastic order, whose chanting is gibberish to those outside of its walls. While this might be true, it might also be fair to say that even for those of us who are ‘monks’, the chanting is still, at times, inconsistent and fragmentary.

This paper does not attempt to explicate all aspects of digital preservation. It does try to illuminate and converge some of the disparate concepts that the authors believe are core to effectively dealing with digital objects, particularly in the context of deciding how to allocate the most appropriate preservation actions for digital objects.

This paper is written from the perspective of collecting institutions that have either a mandate or a desire to preserve access to digital objects over time. It discusses what the authors believe is possible to preserve, and how our interaction with digital objects, and what we expect to get out of those interactions, may influence the actions that we take in order to preserve the objects.

It has been expressed previously, such as in Strodi et al. (2007) that planning is a vital part of any realistic preservation strategy. The authors agree with this view, and it is advocated here that it is necessary to have a clear and realistic articulation of what is expected from any long‐term preservation strategy before the individual preservation actions that make up that strategy can be decided. This articulation is referred to in this paper as the ‘preservation intent’ for an individual object, or collection of related objects.

It is hoped that this paper is of use for collecting institutions who have a need or desire to engage with their digital objects at a more meaningful level, and to more effectively provide appropriate access to their digital objects.

The degradation of physical materials

There are a number of concepts which are true of any object that we wish to preserve. We know that all physical materials are affected by external factors. Irrespective of the mitigating actions we take, given enough time, all physical information carriers will eventually degrade until any information they carry is lost. In order to counteract the eventual degradation of physical materials, there will generally be some point at which the information contained in these carriers will have to be duplicated or reconstructed on another instance of the medium, or a new medium altogether. For example, eventually the pages in a book will turn to dust, and unless the information they carry is transcribed to some other location, it will be lost.

For those objects that carry information, the external factors which affect the object are not only environmental, but in some instances also intellectual. For example, such as when the language in which the information conveyed by an artefact is no longer commonly spoken. In these cases, even though duplicating the object may be enough to prevent the degradation of the information it carries, it may not be enough to preserve the meaning of that information. It may be necessary to preserve the meaning or ‘gist’ of the information via interpretation or translation. For example, an institution may come into possession of a clay tablet inscribed using a very old writing system, such as Cuneiform. Although it may be possible to preserve the original form of the writing on the tablet, there will eventually come a time when, unless the writing is translated into another language, accessing the meaning of the text will become increasingly difficult or impossible.

Irrespective of how we choose to preserve these objects, whether by duplicating them or reinterpreting the information they carry into another language or dialect, there is always a degree to which some elements of the original will be lost. Even if the information on a physical carrier were copied to an equivalent medium, such as a clay tablet transcribed to another clay tablet, it would be very hard to create an exact duplicate that included, for example, the original tool marks. There is always a degree to which one aspect of the information conveyed by the original artefact, be it a clay tablet or a book, must be replaced or changed, in order to facilitate the survival of another of its aspects.

Digital objects

In the context of preserving digital objects, these ideas generally hold true. Like the information scored into a clay tablet, all digital information is stored in a physical medium, such as a hard disk, and therefore is vulnerable to the same external factors as any other information carrier. For example, in the same way a book will eventually decompose into dust, a floppy disk will eventually degrade into its base elements. But, unlike the pages in a book, in which the content might be intelligible up until the pages are too brittle to touch, the information stored on a floppy disk is more likely to become corrupted and inaccessible long before the disk itself has decomposed.

In the case of digital objects, however, losing information due to the degredation of physical material is not so great a problem as it is with other forms of information storage. We tend to have a much more abstract concept of what constitutes the ‘original form’ of a digital object, thereby allowing us to more effectively duplicate this kind of information than we do a book or a clay tablet. Although the physical and logical mechanisms for reading and writing information can vary greatly between mediums, we are open to the idea that a digital object can be ‘moved’ from one carrier to another without changing or losing any aspect of its original form, even if this is technically not the case. For example, although we can copy a digital object from an optical disc to a magnetic tape without incurring any ‘change’ at the bitstream level (the sequence of zeroes and ones), the way in which optical discs and magnetic tape store data are so different that if examined at a physical (microscopic) level, the two versions of the same digital object would bear no similarity.

Digital objects are clearly open to the same risks as non‐digital objects and, in some instance, these risks are realised in digital objects long before they are in non‐digital objects. Digital objects also have a large number of external dependencies, such as hardware andsoftware. Losing access to these dependencies might prevent us from deriving meaning from a digital object long before we lost the ability to read the language in which the information stored in that digital object is expressed.

For example, if a library keeps a Coptic Bible printed on vellum, so long as the reader is capable of translating from Coptic, it should still be possible to derive meaning from the information stored on its pages for the usable lifespan of the material. On the other hand, if the library owns a digital copy of the same book, then not only will the reader have to be capable of translating from Coptic, but access will also depend on being able to open the file in a meaningful way. If the library loses access to the software used to access the file, then it might become impossible to extract any information from the object.

When we speak about the preservation of a digital object, we are in some ways dealing with something far more complex than the preservation of a non‐digital object. Not only is the physical form of and the information conveyed by a digital object open to the same dangers as non‐digital objects, but in order to access the information a digital object contains, there are usually a greater number of external dependencies to account for.

Additionally, in the same way that a book is composed of many pages, a digital object may be composed of many smaller parts. For example, a web site can be thought of as a single digital object, but each of the individual HTML files and image files can also be thought of as individual digital objects. Not only must the dependencies for each individual part be accounted for, but the relationships between each component part must also be maintained, in order to preserve the context of the original. In the PREMIS 2.0 vocabulary, these hierarchical types of object are referred to variously as either ‘Compound Objects’, or ‘Representations’ (PREMIS Editorial Committee 2008). For the sake of simplicity, this document simply assumes that the term digital object potentially implies an logical encapsulation.

To illustrate the process required for access to digital objects in their normal day to day use, consider a simple digital image comprised of a single file, stored on a CD‐ROM. Before any action can take place on the image, there must be an appropriate mechanism for reading the stored form of the digital object. So, a computer with a CD‐ROM drive will be required. Using the CD‐ROM drive, it will be possible for a computer to interpret the stored form of the file into an abstract ‘bitstream’—an idealised transformation of the stored form, generally represented as a sequence of zeros and ones. Technically speaking, the information is still in a physical form, but it is now stored in a computer’s Random Access Memory, which makes it available to other programs on that computer.

Once the file has been abstracted as a bitstream, specific software is required to interpret further information from the binary sequence. For example, an application such as Photoshop could decode the bitstream of the image file into a series of colour values that represent the individual pixels that make up the image.

Finally, the information that has been derived from the bitstream can be interpreted into a presentation that can be relayed to the user. For example, the colour values could be displayed on the screen or sent to a printer, resulting in an image that the user can recognise. The information is not necessarily restricted in the number of ways it is presented. Another potential presentation might be to just show the brightness or saturation of each point, but not the colour. This could also be relayed to the user in a variety of ways.

Aspects of a digital object

Because it is such a complex entity, it is useful to consider a digital object as having a number of different facets, which can be examined and dealt with individually. There has been much work done to articulate how digital objects can be divided into descriptive elements, usually in the context of defining the significant ‘properties’ or ‘characteristics’ of the file or collection in question. The use of these terms has generally come to suggest the information‐level properties of a file which should be preserved between preservation actions. Andrew Wilson (2007) provides a good overview of this these ideas and their development in his ’Significant Properties Report’. However, the concept of ‘significant properties’ can also be interpreted in a much broader sense, and can be understood to signify any particular property of a digital object that is significant to the preserver, at any level of the digital object. This is an observation that was made in Dappert and Farquhar (2009), who remind us that: “[an] idea, concept, act, or thing is not inherently significant. A stakeholder attributes significance to something, typically in a context relevant to some purpose or goal”. This is the interpretation of ‘significant properties’ used in this paper.

It is perfectly reasonable to maintain a subjective view of what constitutes the ‘significant’ details in the case of individual or classes of digital objects. However, depending on the context for a digital object’s preservation, it is also useful to consider the concept of significant properties in the context of the various ‘forms’ that a digital object undergoes during its lifecycle.

Irrespective of whether a digital object is regarded at various points throughout its lifecycle as being as singular entity or as being a collection of components, it is possible to view a digital object as having different forms, depending on how it is currently being interacted with. These are:

Stored: All digital objects are stored in a physical medium, such as the pits on an optical disc or the magnetic charges on a tape. There is no requirement that all the components of a digital object be stored in the same way. For example, before the advent of the Internet, which negated the medium, some early computer Bulletin Board Systems (BBS)—public forums accessed mostly via Telnet or a dial‐up modem—stored a large part of their content on hard disk but also made certain files stored on a local CD‐ROM drive available to users. If understood as being a single digital object, it could be said that the BBS was stored across two different mediums.

Binary: Assuming that the appropriate hardware is present, the stored form of a digital object can be translated into a bitstream. Kevin Bradley describes a bitstream as ’a state of being and a state of not being, of on and off, of plus and minus, of falling below or climbing above a defined or given threshold’ (2007).Technically, a bitstream still has a physical form, as the information is held as electrical charges in Random Access Memory. One advantage of digital objects is that identical bitstreams can be derived from two completely different stored forms.

Information: The binary form of a digital object is a means of encoding information, such as specific strings of characters or numerical values, which might represent anything from the name of a photographer to the colour value of a single pixel. Using the appropriate software, this information can be decoded from the binary form; but once the software is lost, it is generally difficult to reconstruct. Just as identical bitstreams can be generated from two different stored forms, equivalent information can easily be stored in many different binary arrangements.

Presentation: There are many ways to present the information contained in a digital object. For example, an image could be presented simply as the colour values that make up the picture, but also as just the brightness or saturation of each point. At the same time, there may be non‐visual information, such as the metadata tags in a TIFF file, which could be presented as plain text. As there are innumerable ways of presenting the information in some digital objects, it is probable that many users may only ever perceive a small fragment of the potential presentations for a digital object.

Depending on the perceived context for preservation, the significant properties of a digital object may vary depending on which form of the object is being considered. For example, what is significant about the binary form of a digital object will be different to what is significant about the information form. The authors suggest that considering a digital object as having the above potential forms will, in some cases, assist in identifying what properties of an object should be preserved.

Interacting with digital objects

Because there are various aspects to a digital object, there are also various ways in which we can interact with a digital object. While it is possible for a human to interact with the stored form of a digital object, this is not usually the case. For instance, it is possible to see the stored form of data on a punch card with the naked eye, but to see this on an optical disc would require a microscope. Therefore, when we want to interact with the information that a digital object contains, a technological intermediatory is required to derive any real meaning. As such, when we interact with the information carried by a digital object, it is usually in the context of a representation of that digital object, rather than with the digital object itself.

The term ‘representation’ is used in the literature to mean a variety of things. For example, in the PREMIS 2.0 vocabulary, ‘representation’ is used to specify a file, or collection of files which provide the necessary components for a rendition of an intellectual entity, such as a journal or newspaper article (PREMIS Editorial Committee 2008). In the context of this paper, the term is used to indicate a single instance of a realised presentation. For example, to render a colour image of a file to a monitor will provide a representation of a particular presentation of the image, such as the colour values or the brightness values. Each time the image is rendered, this is considered a new representation.

The requirement for a technological intermediatory between a user and any kind of digitally stored information has already been described at a fairly high level by the Performance Model (Heslop, et al. 2002). Essentially, the performance model suggests that a user can only interact with data via a ‘process’. A process is defined by the performance model as a combination of hardware and software that can interact with the data in order to produce a meaningful and perceivable ‘performance’, which, for the purpose of this document, is mappable to a single representation of the digital object. For example, the performance model indicates that to view a TIFF or JPEG file, a user would require image viewer software and the hardware to run that software.

Therefore, to retain appropriate access to a digital object, we must maintain a mechanism to derive meaning from that object.

However, this may not be as straightforward as it first appears. What constitutes ‘appropriate’ access could vary widely, depending on the intended use of a digital object. For instance, although it may be appropriate to simply display a digital photograph saved as a TIFF via a viewing application, this may not be the case for a TIFF which is a part of a texture map for a 3D object, even though it will still successfully present one potential presentation of that data. Alternatively, even for the same digital object, there may be many perceived ‘correct’ ways of accessing the same file, depending on the viewer. For a recording of a piece of music, some users may be satisfied hearing the melody though small desktop speakers, but an audio technician interested in hearing ranges that cannot be reproduced by those speakers would find this an inappropriate representation of the information.

Another complication is that in many instances the user might not be the passive consumer represented in the performance model but might actively play a role generating the representation. This was one of the issues encountered by Winget (2005) when investigating possible methodologies for capturing digital media art performances, in particular a piece entitled ‘Loops’, in which the movements and positions of members of the audience changed the nature of the images and video being displayed.

Even for conventional digital objects, maintaining the exact same configuration of computer and software may not be enough to constantly reproduce a representation. Due to the gradual degradation of the hardware, the representation could be slightly different each time it is produced. Moreover, as has already been mentioned previously, and as is pointed out by Heslop et al. (2002), it is unrealistic to expect the original machines and software used to access one particular presentation of the digital object to last in perpetuity.

This is by no means presented as an insurmountable problem in the performance model. In Just as the same hardware and software can potentially yield different representations over time, the performance model states:

… neither the source nor the process need be retained in their original state for a future performance to be considered authentic. As long as the essential parts of the performance can be replicated over time, the source and process can be replaced. (Heslop et al. 2002)

What this amounts to is that even if some part of the original access environment is replaced, as long as introducing the new element does not alter the final representation, there will be no loss in our ability to access that data. Likewise, it is possible that a digital object can be changed without significantly changing its representation.

However, what is ‘authentic’ and what is the ‘essential’ part of a representation is subjective, even if the methodology for establishing what these might be is not. Although the performance model operates from a perspective in which ’archivists are not interested in the ‘original’ record but in capturing and recreating the fleeting and temporary performance of that record’ (Heslop et al. 2002), this is not the only context for preserving digital objects. Although it is true, and has been noted above, that the concept of an ‘original’ digital object is questionable, it is still valid to suggest that even if researchers are not as concerned with the ‘original copy’ of a digital object in the same way they are interested in the original copy of a paper record, they may still highly value the ‘authenticity’ of the original form of a digital object. In which case, although it may be possible to change the ‘process’ through which adigital object can be accessed, in this case changing the ‘source’ would not be considered an acceptable mechanism for preserving access to that digital object.

Additionally, there is also a question of what would constitute an adequate transformation for the digital object. For example, it is certainly possible to ‘migrate’ a digital object from one file format to another, and to maintain the information required for one particular presentation of that data. However, the particular presentation that is preserved can only ever be subjectively considered important. Migrating a digital object may make it impossible to reproduce a different presentation of the data, which might be as or more important to a different party.

Articulating preservation intent

What constitutes the ‘appropriate’ preservation of a digital object is subjective, and is the subject of much writing and debate in the field (for example, Granger 2000, Gladney 2008 and Long and Pearson (2009). Depending on what aspects of the data are given more importance, or what presentation of the data is considered more appropriate, different mechanisms for preserving the data over time will be more or less suitable. Also, depending on how much change to the data is considered acceptable, the preservation actions required will become varyingly complex.

If a preserving institution does not understand which parts of a digital object it wishes to preserve, it will be much more difficult to plan appropriate and effective strategies for preservation over any time period. It will also be very difficult to audit the effectiveness of any preservation actions. It makes more sense, therefore, for an institution to establish its expectations for preservation before any strategies are considered.

For example, if it is important to maintain certain parts of the original digital object, then preservation strategies which change the object they are preserving, such as migration, may not be as appropriate as emulation, which can leave the object unchanged. Alternatively, depending on what aspects of a digital object an institution is interested in accessing, different migration paths may be more or less appropriate. For instance, one institution may decide to normalise a spreadsheet into a PDF, which would retain the cell values but destroy any formula used to calculate them. This would favour a particular presentation of that digitalobject over the information form of the original. On the other hand, another institution might decide to normalise spreadsheets into ODF, which may lose some formatting but retain the formula used to derive each cell. In this case, being able to retain the information form of the file would be seen as more important.

Before any preservation actions can be carried out, it is therefore important that the ‘preservation intent’ be clarified for the object or collection of objects being preserved. ‘Preservation intent’, also used by some of the authors in Pearson and Long (2009) and Long (2009), encapsulates the requirement to preserve a digital object, the context and goals for its preservation, and an understanding of the length of time for which the digital object is to be preserved.

The following steps of assessment should take place before making any decisions aboutpreservation strategies:

The institution should identify that there is a requirement or intention to preserve the object or objects in question.

The specific characteristics and properties of the object that the institution wishes to preserve should be identified. This may require describing the object at various levels and, presumably, will depend on both the degree to which the institution wishes to engage with the object and the degree of contextual and technical knowledge that the institution has about the object.

A time frame needs to be established that indicates the length of time those aspects should be preserved. The period might be quite fleeting, just a few months or days. On the other hand, it might be intended that an object remain preserved ‘forever’.

To clearly define the precise terms for preservation will make it much easier to identify which preservation actions may change the object in a ways which are not acceptable. Without this articulation, even though preserving actions can still be carried out on digital objects, it is likely that they will not be as effective.

To provide an extended example, reconsider the case of an image file stored on a CD‐ROM. For one institution, what makes the digital object important could be the image displayed on the screen. In this case, the preservation intent for the digital object might be that the presentation of the object is easy to access and change as little as possible, for the length of time the institution is responsible for the image. This means that one presentation of the digital object is given more importance than the stored, binary or information aspects of the file. This institution might decide, therefore, to copy the file from the CD‐ROM to managed storage, migrate the file into a TIFF and (for now) create JPEG delivery copies as required. Although there may be some information that might be lost, such as certain metadata tags, this might be acceptable, given that the preservation intent for the digital object is being fulfilled.Another institution may want to retain some of the information carried in the original image’s metadata, such as the photographer’s name or the type of camera used to take the image. In this case, the institution might still decide to convert the image to TIFF and, in addition, decide to plan its preservation actions in such a way that the significant properties of the image, such colour‐space or image size, or those identified in the work of Hedstrom and Lee (2002), are maintained across preservation actions. This might, for example, mean that the institution must choose a migration format which supports those particular significant properties.

Finally, in some cases, what may be important about the digital object is not only the presentation, but the binary form file and the CD‐ROM it arrived on. All of these are important artefacts and the preservation intent may be to not only maintain the presentation of the image but to also preserve the original stored, binary and information aspects of the file.

As such, the institution might be more limited in the preservation actions open to it. Although it might still choose to create a TIFF version of the file for long‐term viewing, it may also have to carry out more involved preservation preparations for the original CD‐ROM. It may be necessary to create a technology maintenance plan that would ensure a computer with adequate hardware and software was available to access the optical disc over time. It may also be necessary to ensure that replacement parts could be sourced, as needed, and that the optical disc be housed in conditions that would best prolong its longevity.

To formulate the preservation intent for an object or collection of objects—digital or otherwise— is to define the expectations for what constitutes a interaction with that object or collection of objects that is both meaningful and practically feasible. The context in which an object is seen as ‘useful’ is bound to differ between institutions, as are the resources for preservation that an institution has available to it. It is also very likely, therefore, that the preservation intent for an object will differ between institutions. Indeed, there may be multiple preservation intentions for the same object even within the same institution, if it is being used in different contexts.

It has been the experience of the authors that without clear guidelines as to what should be preserved—or how—conducting any auditing action with an institution‐wide scope can be frustrating and ineffective (see Long 2008). If an institution is able to clearly articulate the terms under which a digital object should be preserved, it should be much easier to meaningfully audit the preservation conditions for those objects. If the range of what constitutes acceptable change in a digital object is articulated, this should also help to ensure that when change to any given aspect of a digital object occurs, it is only ever as a part of a planned and predictable action. This is a vital part of ensuring a sense of confidence in and accountability for large‐scale preservation strategies.

Conclusion

Over time, unless we take some form of mitigating action, it is likely that our access to digital objects will be compromised. As such, there is strong reason to believe that there will be change, be it in the object itself, its access dependencies or even in our ability to understand the information it carries. We must ensure that when this change occurs, it does so in a way that we have planned, and for which the consequences are predictable.

This paper has presented what the authors believe are some of the essential ideas and thinking about digital preservation. The aim of this paper has been to assist both the National Library of Australia and other institutions to think about digital objects in ways that will help to identify which preservation actions are most appropriate for a particular circumstance. It has examined the basic nature of digital objects and how users interact with those objects.

The paper has argued for the importance of establishing a clear preservation intent before attempting to engage with a preservation strategy for digital objects. Because of the highly subjective nature of our interactions with digital objects, we do not necessarily need to strive towards the most ‘authentic’ reproduction possible. Rather, what constitutes ‘meaningful’ access, or an ‘authentic’ reproduction depends on the preservation intent for the object being preserved. Because of this, the simplicity or complexity of any preservation solution will be governed by the approach an institution takes to preserving its digital objects, and the depth with which they engage with that approach.

There will always be an unavoidable degree of change in our objects over time. Whether it is the stored form of a file or the presentation of a complex digital object, we will always be forced to make decisions favouring the permanence of certain aspects of an object over others.

This paper has undertaken to bring together and clarify some of the core ideas and theories in digital preservation, in order to better facilitate the minimisation of change in the digital objects stored by the National Library of Australia. It is hoped that this will prove useful in clarifying some of the terminology and concepts to both those who are in or are yet to be initiated into the ‘order’.