I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Wednesday, November 24, 2010

The Half-Life of Digital Formats

I've argued for some time that there are no longer any plausible scenarios by which a format will ever go obsolete if it has been in wide use since the advent of the Web in 1995. In that time no-one has shown me a convincing counter-example; a format in wide use since 1995 in which content is no longer practically accessible. I accept that many formats from before 1995 need software archeology, and that there are special cases such as games and other content protected by DRM which pose primarily legal rather than technical problems. Here are a few updates on the quest for a counter-example:

Never is a very long time. Black vs. white arguments of the kind that pit "never happens" against "the sky is falling" may be interesting but there are also insights to be gained from looking in the middle. Below the fold are some thoughts on what a middle ground argument might tell us.

Suppose, for the purpose of making the discussion easy, that formats have lifetimes that are randomly distributed about a mean lifetime. This mean lifetime would represent that half-life of digital formats; I applied the same concept to the half-life of bits in this post. The half-life of digital formats is an important number for digital preservation, worthy of serious research. In the meantime, can we estimate what this half-life of formats is?

The fact that we have not observed an instance of format obsolescence in the last 15 years allows us to make a rough estimate. Assume that there are 100 widely used formats (probably an underestimate). Then if the half-life were 100 years we should have seen 7 formats go obsolete. We saw none, so if we accept these assumptions we should be fairly confident that the half-life of a widely used digital format is more than 100 years.

In facing the prospect that a format will eventually go obsolete, we have a choice between spending money now preparing for it, or waiting until the format goes obsolete and spending money recovering. The longer the half-life, the better waiting looks. Lets look at the extreme positions first:

By spending $X now, we ensure that the cost when the format goes obsolete is $0.

By spending $0 now, we ensure that the cost when the format goes obsolete is $Y.

We could invest the $X now at an interest rate of I in order to fund the eventual obsolescence; if the principal plus H years (the half-life of formats) of compound interest is more than $Y we should take the second option.

If the half-life is 100 years, and the long-term real interest rate is 3%, and the eventual cost (in current dollars) of dealing with obsolescence is $1, this means that the most we can afford to spend right now preparing for it is about $0.05. And we can only afford to spend that much now if we are sure that by doing so we guarantee that the eventual cost of obsolescence is $0.

We have seen no obsolescence in the past 15 years. Assume that half of all formats will go obsolete in the next 15 years; even the proponents of format migration would admit that this is extraordinarily unlikely. Even in this extreme scenario, it is better to spend $1.56 when the format goes obsolete than to spend $1 now. And, if you do spend $1 now, for every cent you end up spending when obsolescence happens, the $1.56 is increased by one cent.

Another way of looking at this is to realize that we can never be sure that whatever we do now to prepare for format obsolescence will work. Suppose there is a 20% chance that spending $1 now doesn't help, and we end up spending the $1.56 as well. We would be better off spending nothing now and spending $1.87 when obsolescence happens.

What this analysis shows is that even in exceptionally pessimistic scenarios to justify spending $1 now on preparing for format obsolescence we have to be sure that doing so is more effective than spending about double that when obsolescence happens. In scenarios that conform more closely to what we observe in the real world, we would have to be sure that spending $1 now is more effective than spending about $20 when it is needed.

10 comments:

David Whilst I agree with your financial analysis, and the consequences for preservation planning, I don't think it's accurate to say that no formats in wide use since 1995 have suffered problems.

My counter-examples are MS Word 95 and its Powerpoint relative. Although many documents are accessible in later versions of these applications without problems, some features cause significant layout problems. Frames were implemented with completely different semantics, so far as I can tell, in Word 97, and documents which used them in earlier versions could be rendered unreadable on screen or paper as a result, with frame content on top of other text. My team at the time had a large collection of technical documents based on a single template that used this feature, and it required a lot of manual editing to deal with.

The format is not obsolete in that many later applications claim to support it; but problems can arise in the migration that are difficult to predict and currently require manual intervention. (I'm sure an automated fix must be possible but I've not encountered it.) The CEDARS project encountered comparable problems with a widely-used Macintosh drawing programme as I recall - notional between-version compatibility that wasn't perfect, and resulted in complete garbage after 3 migration cycles.

That doesn't undermine your basic argument; I agree we tend to get fair warning of obsolescence of widely-used formats and we can deal with the situation as it arises. But to claim we've not had to deal with the situation yet is minimising the problem. We have had to deal with it, and it usually hasn't been as expensive as it might be - as long as action was taken sufficiently early.

All that is true only, if you have no storage medium for really long term archiving. I however have it. Optical discs from glass. The digital informations will be etched directly into the glass. The french Académie des Sciences & Académie des Technologies certified as the only digital storage medium with unlimited longevity. Our discs are since 2007 on the Mars with the NASA Phoenix expedition and since may 2010 on the way to Venus with the japanese solar photon sail IKAROS.

http://sdg-master.com/

If you visit this site, you can see, how many problems of long term archiving will be solved and some nice photos and videos about our technology, which has been the Vice President of the European Commission, responsible for the Digital Agenda for Europe, officially submitted

On the site you can find contact addresses and numbers, don't hesitate to call us and take questions.

Word and PowerPoint 95 are commonly suggested as counter-examples. I have some questions about them.

First, I note that Kevin does not actually suggest these are a counter-example. I didn't say "have suffered problems" I said "no longer practically accessible". Is the standard for digital preservation a level of compatibility that the original environment (Microsoft's Office suite) was incapable of providing? Migration to this standard would be economically infeasible; only emulation of the earlier software could meet it.

Second, I note that Kevin's team had to hand-edit, presumably in Word 97, to fix the problems. Is anyone suggesting hand-editing individual documents as a migration technology? It isn't going to be economically viable for "widely used formats".

Third, if some time in the past you dealt with these issues by automatically migrating the format, what tools did you use to do the migration? Why are those tools no longer available?

Johann, as regards the use of magically reliable media for long-term preservation, I suggest you read my CACM paper. It explains that (a) media reliability is only a small part of the problem of long-term preservation, so that even using extraordinarily reliable media does not solve the problem, and (b) that the reliability needed from the system as a whole is so high that you, as a vendor of a solution, have no feasible way to prove that your system is reliable enough to meet it.

Kevin puts forward two cases. The Windows one required hand-editing individual documents so cannot be said to be an affordable migration. The Mac migration didn't work.

So presumably the cases in which:

"We have had to deal with [format migration], and it usually hasn't been as expensive as it might be - as long as action was taken sufficiently early."

are not these cases but some other formats widely used since 1995. So which formats became "no longer practically accessible" and were cheaply and efficiently migrated to a less doomed format? These would be counter-examples, and I would be interested in the details.

While I was in library school in the early oughts, there was a changeover in Quark XPress versions, such that the files given me for a project would not open in the current version of Quark. I had to tap a friend of mine who had an older version to save the files out in something I could open.

The Penta typesetting system went belly-up in the early oughts as well. Penta was a high-end system, often used for scholarly and scientific typesetting because of the beauty of its math typesetting. Penta isn't necessarily too horrible, because I believe its native format was text-based, but I'm pretty sure that post-'95 versions of Quark present problems, and I wouldn't be at all surprised about Adobe (PageMaker and its successors) as well.

I generally agree with this but was wondering where you draw the line of obsolescence: there are many single-vendor formats which are increasingly unsupported by modern toolchains (e.g. the Real and other legacy streaming formats, classic Windows media, older legacy video codecs like Indeo); a preservationist with time & budget can probably handle all of them but they use lossy compression - is transcoding loss considered acceptable? Is transcoding to a [large] lossless file which could then be compressed with a standard format for easier playback? The latter case has technical drawbacks but definitely seems to address most concerns, albeit with a high likelihood of requiring a convoluted toolchain the longer conversion is delayed.

The other question I'd have involves patents & licensing: does the definition of obsolete include open-source software which handles a format but is patent-encumbered in your country?

I think I failed to communicate my argument to Dorothea and theno23, so let me restate it.

Formats that are widely used in the web context become published, and stop changing in incompatible ways because they are in effect network protocols. Network protocols do not change in incompatible ways because doing so breaks the network. No flag day on the Internet is the slogan that represents this truth.

The examples of raw camera formats and internal formats of typesetting systems are not widely used in this sense. They are like pre-1995 word processor formats, the private property of specific applications. At any given time there will be "not widely used" formats of this kind, typically in fields that have yet to feel the impact of the Web.

A current example, raised by Mackenzie Smith in questions at my CNI Plenary is CAD systems. Eventually, market forces drive use of published formats; in the mean time this is a market opportunity for companies such as Right Hemisphere.

I am not saying, therefore, that there are no formats with migration problems, but that content in these "not widely used" formats represents a tiny fraction of the content that needs to be preserved. However, as we see in comments here and on my other posts in this area, this tiny fraction consumes a totally disproportionate amount of attention and resources. This is the more true in that we have a viable, if less convenient, alternative to format migration in these cases, namely emulation.

Treating the vast bulk of preserved content as if it were "not widely used" raises the cost per byte of preservation to a level that makes it impossible to preserve large amounts of important published content. The loss to the future of this huge amount of easily preserved content is vastly greater than the loss of the small proportion of the content in difficult formats that even the enthusiasts for format migration can manage to rescue.

Chris Adams raises the issue of formats that are encumbered by patents. This has often been raised in the context of preservation, but I only now realized that the argument of this post shows that patents are unlikely to be a significant concern.

Unlike copyright, which is now effectively eternal, patents expire. In most cases now patents last for 20 years from filing. The argument above suggests that the lifetime of widely used formats is typically at least this long. Thus, although being encumbered by patents may be a disadvantage for a format in many ways, it isn't likely to be a problem for preservation because by the time the format becomes obsolescent the patent will have expired.