LOCKSS: Format Migration

Obsolescence of Web Formats

The time that would elapse between the introduction of a new format and its obsolescence would likely be short.

The OAIS reference model inherited this analysis, and its implication that expending resources now to prepare for the likely rapid obsolescence of formats was desirable.

The LOCKSS technology was designed from the start specifically to preserve content published on the Web. A Web format becomes obsolete when support for it is removed from browsers. Theoreticalanalyses of the mechanisms by which this would happen predicted that this would be a rare occurrence, because the incentives for doing so are weak, and the disincentives are strong. Subsequent practical research by Matt Holden of INA into the renderability of Web formats that were predicted to be the most likely to suffer obsolescence, audio-visual formats from the early days of the Web, showed that 15 years later format obsolescence was negligible.

Further, the alternative to format migration is emulation. The argument for format migration has always been that it would be impractical to deliver emulation to end-users. Recent work has demonstrated two viable paths to delivering emulation to readers of the types of web content preserved by LOCKSS and CLOCKSS:

A team from the University of Freiburg presented papers at IDCC2013 and iPRES2013 showing that it was possible to deliver emulation-as-a-cloud service to browsers using only HTML5, with no special plugin. What delivery method could be more convenient than embedding a live emulation in a Web page simply by pasting a link into it?

Building on earlier work by, among others the University of Oxford, running Javascript emulations of obsolete environments in the reader's browser is now routine:

Thus it is far from clear that, even if Web formats eventually suffer obsolescence, format migration would be necessary. By the time obsolescence might happen, it might well be that delivering a transparent emulation to the reader's browser would be the preferred method.

LOCKSS Strategy for Format Obsolescence

Thus Web archives, such as the CLOCKSS archive, have a different model of when to devote resources to format migration, because:

The probability of a format going obsolete is low.

If a format does go obsolete, it will be a long time after its introduction.

It may well be that emulation, rather than format migration, would be the preferred way to deliver content in an obsolete format, were obsolescence ever to occur.

Those digital preservation systems that perform preemptive bulk format migration do not discard the original, but store both the original and the migrated copy.

Given these observations, the LOCKSS system's strategy for preserving content is:

Store, and maintain the integrity of, the original bits.

Exploit the content negotiation capabilities of the Web (and presumably any successor technology to the Web) to detect when a reader's browser does not support the original format in which the bits are stored.

Use this pipeline to generate a temporary access copy of the original in a format suitable for the reader's browser.

Discard this access copy when it is no longer needed.

A framework to support this strategy was implemented in the LOCKSS software and demonstrated in 2005. To avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area is on hold. When there is evidence that some format of content under preservation is facing obsolescence, a decision will be taken as to whether a production version of this migration strategy is the appropriate path to take, or whether (for example) an in-browser emulation strategy would be more effective.

This strategy has a number of significant advantages:

It uses the minimum amount of storage.

It does not waste resources migrating content which is unlikely to be accessed and, if ever accessed, is unlikely to have suffered format obsolescence.

It performs any format migration that is actually necessary as late as possible, when the technology for performing it is likely to be better.

It expends resources as late as possible, exploiting the time value of money to the maximum extent.

It does not commit to format migration, which may not be the appropriate strategy at the time the reader requests access.

Format Migration in the CLOCKSS Archive

The CLOCKSS archive is a dark archive. No readers (Consumers in the OAIS terminology) ever "interact with [CLOCKSS] services to find preserved information of interest and to access that information in detail". If content is ever triggered from the archive, readers access it from one of a number of re-publishing systems. Dissemination of triggered content is a transaction between the archive and one or more of these republishing systems which involves construction of a Dissemination Information Package and its transmission to the re-publishing system(s). If a subsequent reader's browser is unable to render the format in which the digital object was represented in the DIP, and is thus stored in the re-publishing system the reader is accessing, the technique described above can be applied.

Thus, if a format in which a digital object is stored in the archive is known at the time of a trigger event to be obsolete (in that the vast majority of browsers in general use are unable to render it) the technique described above can be applied in the process of generating the DIP by emulating a browser that cannot render the format in question. In this case the format in which the digital object is stored in the re-publishing system will be different from that in the archive, the result of a format migration of the original. The original continues to be preserved in its original format in the archive. What is stored in the re-publishing system is the temporary access copy.

Availability of Format Converters

Any strategy for format migration, not just the one taken by the LOCKSS software, depends upon the timely availability of converters capable of transforming the doomed format into a less doomed one. As regards the Web formats preserved by LOCKSS networks and the CLOCKSS archive, the sunk investment in (and thus value of) existing Web content in format A means that format A will not be rendered obsolete by format B (i.e. support for rendering format A will not be removed from, and support for format B added to, browsers in common use) unless and until there is a suitable converter from format A to format B. Thus the risk of a format going obsolete with no suitable converter is low.

Further, the Web content of LOCKSS networks and the CLOCKSS archive can be satisfactorily rendered by a completely open source stack. Thus there are open source renderers for the content, which:

Note that, since in the LOCKSS approach format migration takes place at access time rather than at some earlier pre-emptive migration time, any criticism of the approach on the basis that converters might not be available applies a fortiori to the pre-emptive approach.

Again, to avoid wasting resources implementing capabilities which have no realistic prospect of being needed in the foreseeable future, work in this area, such as integration with a registry of format convertors, is on hold.