Navigation

User login

A Pragmatic Approach to Preferred File Formats for Acquisition

Submitted by editor on 30 April 2010 - 12:00am

Dave Thompson sets out the pragmatic approach to preferred file formats for long-term preservation used at the Wellcome Library.

This article sets out the Wellcome Library's decision not explicitly to specify preferred file formats for long-term preservation. It discusses a pragmatic approach in which technical appraisal of the material is used to assess the Library's likelihood of preserving one format over another. The Library takes as its starting point work done by the Florida Digital Archive in setting a level of 'confidence' in its preferred formats. The Library's approach provides for nine principles to consider as part of appraisal. These principles balance economically sustainable preservation and intellectual 'value' with the practicalities of working with specific, and especially proprietary, file formats. Scenarios are used to show the application of principles (see Annex below).

This article will take a technical perspective when assessing material for acquisition by the Library. In reality technical factors are only part of the assessment of material for inclusion in the Library's collections. Other factors such as intellectual content, significance of the material, significance of the donor/creator and any relationship to material already in the Library also play a part. On this basis, the article considers 'original' formats accepted for long-term preservation, and does not consider formats appropriate for dissemination.

This reflects the Library's overall approach to working with born digital archival material. Born digital material is treated similarly to other, analogue archival materials. The Library expects archivists to apply their professional skills regardless of the format of any material, to make choices and decisions about material based on a range of factors and not to see the technical issues surrounding born digital archival material as in any way limiting.

Why Worry about Formats?

Institutions looking to preserve born digital material permanently, the Wellcome Library included, may have little control over the formats in which material is transferred or deposited. The ideal intervention point from a preservation perspective is at the point digital material is first created. However this may be unrealistic. Many working within organisations have no choice in the applications they use, cost of applications may be an issue, or there may simply be a limited number of applications available on which to perform specialist tasks. Material donated after an individual retires or dies can prove especially problematic. It may be obsolete, in obscure formats, on obsolete media and without any metadata describing its context, creation or rendering environment.

Computer applications 'save' their data in formats, each application typically having its own file format. The Web site filext [1] lists some 25,000 file extensions in its database.

The long-term preservation of any format depends on the type of format, issues of obsolescence, and availability of hardware and/or software, resources, experience and expertise. Any archive looking to preserve born digital archival material needs to have the means and confidence to move material across the 'gap' that exists between material 'in the wild' and holding it securely in an archive.

This presents a number of problems: first, in the proliferation of file formats; second, in the use of proprietary file formats, and third, in formats becoming obsolete, either by being incompatible with later versions of the applications that created them, or by those applications no longer existing. This assumes that proprietary formats are more problematic to preserve as their structure and composition are not known, which hinders preservation intervention by imposing the necessity for specialist expertise. Moreover, as new software is created, so new file formats proliferate, and consequently exacerbate the problem.

Working with File Formats

The Library has two situations in which file formats are an issue; current formats and obsolete formats. The first is less of an issue in that the Library works with its donor/creator community to receive material from them in current formats and on current media. To date, formats have included Microsoft Office files, JPEG, TIFF and text files, commonly transferred on CD-ROM. Generally this material is easier to work with because of its currency.

The second situation is one in which material is received in an obsolete format, sometimes on obsolete media. Experience suggests that each transfer of obsolete material requires its own unique approach, is more time-consuming and much more difficult to handle. Appraisal is complex in that much work must be done before the relevance of any material actually reveals itself.

Because of the variety of file formats, and the differing circumstances of their transfer, the Library does not intend to provide a 'set list' of file formats it can or cannot accept for accession. Instead it sets out a series of nine broad principles against which any file format can be rated. Acceptability of any format is based upon the Library's belief in its ability to manage and preserve that format successfully into the future, set against the intellectual 'worth' of that material. In the Library the process is known as technical appraisal.

The Library provides three levels of confidence in its ability to preserve material: high, medium and low. These levels of confidence are based on resources available, the availability of tools for managing digital material and experience with the life cycle management of born digital materials. This approach is based on work done by the Florida Digital Archive [2].

The Florida Digital Archive

The Florida Digital Archive has published its own table of preferred data formats [3]. Its purpose is to, '...help Florida University administrators develop guidelines for preparing and submitting files to the Florida Digital Archive.' The table lists a series of media; text, audio, video, etc. Using a grid, specific formats are set out under one of three preservation 'confidence' headings; high confidence level, medium confidence level and low confidence level. High confidence is expressed in formats that are 'simple', eg text, or which have open or published specifications, eg TIFF. Low confidence is expressed in formats which are proprietary, closed or protected by rights mechanisms, eg Real audio or Microsoft Office formats.

This approach by the Florida Digital Archive is very useful. However, it remains prescriptive. The Wellcome Library aims to collect a wide range of material from a diverse donor community. To restrict acquisitions only to materials which meet a narrow set of criteria, ie format, is to set a pre-emptive selection policy that may exclude material of value. It also sends an unhelpful message to donor/creators that format is somehow of primary importance. Which is not the case. The Library is seeking an approach that offers flexibility but which does so within a defined framework. One which allows it to accept a wide range of formats, but which seeks to understand the implications of accepting any format.

The Wellcome Library's Model

The acceptability principles as used by the Library and discussed in this article are:

Principle

Definition

Formats in current, widespread or common use

Includes formats created by applications currently in widespread or common use, eg Microsoft Word. The concept of ‘current’ will change over time

Formats which are non-proprietary

Includes formats created or renderable by open source/freeware or other applications designed for non-commercial use, eg OpenOffice formats.

Formats which are standards-based

Includes formats which are supported by international published standard/standards defining their technical/logical/structural properties, documentation which is publicly available, MPEG.

Formats for which specifications are publicly available

Includes ‘open’ formats which are supported by documentation defining their technical/logical/structural properties, documentation which is publicly and freely available, eg TIFF or OpenOffice.

Formats which offer platform-independence

Includes formats which can be rendered by many applications, and have no dependence upon any single application eg JPEG, XML.

Formats which are uncompressed

Includes formats which have not been committed to a type that is based upon data loss, ie ‘lossy’, eg uncompressed JPEG2000.

The remaining principles, whilst not specifically related to any one format, play an important role in the acquisition and appraisal process. They are considered in relation to discussions regarding the consequences of accepting any format; most especially, with regard to the economic consequences of accepting formats, particularly obsolete ones, and the cost of recovering a viable datastream.

These additional principles, not directly related to format, but which affect appraisal decisions are:

Principle

Definition

Considerations of the intellectual content of the file(s) and/or the importance of the creator

Considerations of the intellectual content of the file(s) are essential and should include a format’s ability to be maintained in an ‘authentic’ condition.

Considerations of how material complements existing material

Considerations of how material that is offered to the Library complements, supports and extends existing content are essential.

Economic/resource implications

Consideration of the total cost of any data recovery, cost of format migration and cost of long-term management. Factors balance economic cost with human or technical resources required.

Cost and the resources required to work with obsolete media and files will play a role in the appraisal process, although cost alone will not be a factor in the decision to accept material. Subjective judgement will be applied to the provenance of the material and the likelihood that not only can material be recovered but that it will be historically significant and worthy of the effort of preservation.

Appraisal decisions can be challenging when archivists are holding obsolete media to which they have no access. If they have only sketchy or incomplete evidence of the content of the media, or its context, then justifying the cost of data recovery can prove difficult. What if the medium is blank? What if the data that medium holds are also obsolete and so also require potentially expensive and complex data recovery? This is the conundrum of working with obsolete media and formats.

Factors other than format affect the ability of the Library to perform life cycle management upon material. They include the use of compression tools to create aggregations of files, and files that are encrypted or use some form of digital rights management (DRM) controls. Files of this type or which contain these features present additional life cycle management challenges and add complexity to that process. In some cases files in these categories may be dependent upon third-party resources to be viable, eg de-compression software. Where these third party resources cannot be preserved for proprietary reasons, there may be little point in preserving the data.

The Library's model is based on sound archival practice. It is similar to the one for working with analogue materials which may require a conservation appraisal or conservation intervention prior to accession. For some material the cost and effort of any intervention may not be justifiable.

Confidence Levels for Data Formats for Preservation Purposes

Starting with the table developed by the Florida Digital Archive, the Library has developed a set of 'confidence' principles for digital material. It is used to help staff and donor/creators identify types of format for long-term preservation. The Library's levels of confidence are based on its expertise, experience and access to technical support at the time material is offered. Rather than being hard-and-fast 'rules,' they are guidelines on which selection and appraisal decisions can be consistently applied and based, and therefore justified. Yet at the same time they are guidelines that allow for flexibility and for specific decisions to be applied to particular bodies of material.

Practical Considerations

For current material the Library expects to have few problems with access, and work with our donor creator community helps to make this a reality. The Library accepts transfer on CD-ROM, or portable hard drive and donors are asked to provide information about formats and creating/rendering applications. The Library has access to a range of current hardware and software tools and applications to make this process practicable.

The principles work well in conversations with donor/creators looking to transfer material to the Library. They provide a foundation for discussion in which considerations of practical issues can be set against intellectual appraisal. The implications of accepting one format over another can be compared and options considered. Justification for accepting, or rejecting, material can be set out plainly and clearly, all of which provides a useful educational process for all parties.

The Library expects this approach to evolve. Formats will change, experience with digital material will grow, while more and different resources will become available. The principles are flexible and pragmatic. They can be modified or reviewed at any time, without compromising past decisions. Basing them in professional archival practice provides a sound basis for the development of further professional archival expertise, and from this stems 'proper' management of the material.

Use of flexible principles supports the professional practice of archivists. The principles allow archivists to use their whole professional experience and judgement in appraising material. The principles 'free' them from having to worry about technical considerations yet provide a framework within which the latter can be addressed. This gives archivists confidence to work with born digital archival material, an important consideration as more and more archival material is created only in digital form.

Disadvantages

This approach is not without its disadvantages. It may prove over time to be entirely wrong, for the Wellcome Library at least. Archives that proscribe just a few formats as acceptable for deposit may find that fewer resources are required to manage that material in the future. The long-term cost of accepting a broad range of formats may be economically unsustainable.

It is therefore a pragmatic approach based on current experience and expertise, making use of the resources to which the Library currently has access. Equally it is an approach that helps, in the short term, to support the diversity of material ingested, but which may store up problems for the future. The complexity of 'preserving' many file formats may prove to be beyond the Library's capability. The principles work only for 'original' formats intended for long-term storage and preservation; they are not intended to specify dissemination formats.

The pragmatic approach does, potentially, leave the Library with a multitude of formats to manage in the digital object repository. This is not an ideal situation, but is one that tools such as PLANETS/PLATO might assist in addressing.

At the same time this approach does not assist the Library to build closer relationships with its donor/creator community. Since it places few 'limits' on preferred formats, it does run the risk that donor/creators may come to expect that the Library will accept 'any' format. However it matches a 'traditional' archival model in which transfer of material to an archive is based on a negotiation between archive and creator.

Conclusion

This approach is clearly based on pragmatic considerations and is appropriate for the Wellcome Library at this point in time. It is not without risk. It is not without disadvantages. It is based upon our levels of expertise in dealing with born digital archival material, which are currently somewhat immature. It is also based upon the access we currently have to tools, hardware and software to work with digital material. All of this will change over time.

The approach is designed to strike a balance between turning digital preservation activity into a wholly technical exercise in which bit stream preservation is the sole aim, and a more aesthetic archival approach in which the aim is to provide access to meaningful material that can form coherent archival collections. Other factors such as intellectual content, significance of the material and/or the donor creator and any relationship to material already in the Library also play a part.

With no hard and fast 'rules' for format selection, archivists are asked to make a professional judgement about material they are offered based on both technical and archival principles. Whilst it may be imperfect, this subjective approach represents the best balance we can strike between practicable and realistic preservation and the aims of broad collection development.

The Library can only accept digital material for long-term preservation if it retains confidence in its ability to provide meaningful access to that material for the long term. This principle applies whether material is in a current or obsolete form. What has been set down in this article is a framework within which a level of 'confidence' can be tested and applied whilst allowing the Library to collect a wide range of material in a range of formats.

Author Details

Annex: Simple Scenarios

Scenario 1 : An organisation whose records already form part of the Library's collections offers to transfer all material to the Library in digital form instead of paper. The organisation uses a current version of Microsoft Office, but publishes its monthly newsletter as a PDF. The organisation proposes to transfer material to the Library via USB hard drive.

Approach:

The Library accepts the material given that content is already held from that organisation, though some appraisal may take place

Transfer on USB hard drive is acceptable as these devices are reliable and compatible with current Library hardware

MS Office is in current, widespread and common use, so formats are acceptable

MS Office files can be properly identified and retained in MS Office formats for long-term preservation

A qualification is placed on the PDF newsletters that they are not in a preservation format and cannot easily be migrated into a more acceptable format. They may be accepted and held until such time as the PDF version becomes obsolete, at that time if a suitable migration approach cannot be determined they may be held as a bitstream only

The Library has HIGH confidence in its ability to preserve the Microsoft Office documents, but only MEDIUM or LOW confidence in its ability to preserve the material in PDF format. Material would not be converted to PDF/A as the process for authentic conversion to PDF/A and long-term preservation of this format is currently beyond the capabilities of the Library.

Scenario 2: The child of a scientist offers material to the Library after the parent's death. The offer comprises ten 5¾-inch floppy disks. The Library holds no other material by this individual. The disks are thought to contain text or word-processed files created in the late 1980s and early 1990s. There are no metadata about the content of the disks, the context of any content nor exactly what data may be on the disks.

Approach:

The Library rejects the material based on the lack of evidence of the content of the disks and there being no other material in the Library by this individual

The Library having no access to hardware that can read the 5¾-inch floppy disks, cannot assume that the disks contain data or that they can be read

Data formats are likely to be obsolete

Data recovery is likely to be complex, time-consuming, expensive, and with little certainty of success

Information suggests that this material does not add to the Library's body of knowledge

The Library has LOW confidence in its ability to preserve the material on offer.