I'm David Rosenthal, and this is a place to discuss the work I'm doing in Digital Preservation.

Sunday, January 4, 2009

Are format specifications important for preservation?

On the Digital Curation Centre Associates mail list, Steven Ranking pointed to the release of Microsoft's specifications for the Office formats under their Open Specification Promise. This sparked a discussion in which two topics were confused; the suitability of Microsoft Office formats for preservation, and the value of the specifications for preservation. As regards the first, I believe that "it became necessary to change the content in order to preserve it" is a very bad idea; we should preserve what's out there without adding cost and losing information by preemptively migrating to a format we believe (normally without evidence) is less doomed. I'm a skeptic about the second; I don't think preserving the specifications contributes anything to practical digital preservation, as I explain below the fold.Nearly a quarter-century ago, James Gosling and I and a small team at Sun cloned Adobe's PostScript language for the NeWS system. Adobe had just published the PostScript language specification in the "Red Book". We started from this book, but we also had an Apple LaserWriter running Adobe's implementation of the language. When we found something obscure or missing in the book, we could run experiments on the LaserWriter to figure out what our implementation was supposed to do. This is close to what Silicon Valley refers to as a "clean-room" implementation, ensuring that the implementors have access only to public information. Since then, others have repeated the process with even greater fidelity.

So I'm someone with actual experience of implementing a renderer for a format from its specification. Based on this, I'm sure that no matter how careful or voluminous the specification is, there will always be things that are missing or obscure. There is no possibility of specifying formats as complex as Microsoft Office's so comprehensively that a clean-room implementation will be perfect. Indeed, there are always minor incompatibilities (sometimes called enhancements, and sometimes called bugs) between different versions of the same product. As between, for example, Office on the PC and Office on the Mac.

Those who argue that depositing format specifications in format registries is essential to, or even useful for, digital preservation seem to have in mind a scenario like this. Some time after a format is obsolete and no renderer for it is any longer available, some poor sucker is assigned to retrieve the specification from the format registry and use it to create a brand-new one. How likely is this to happen?

The pre-condition for the preserved format specification to be useful is that there is no renderer for the format. That necessarily implies that there is no open source renderer for the format. Logically, there are six possible explanations for this absence. They are quite revealing:

1. None was ever written because no-one in the Open Source community thought the format worth writing a renderer for. That's likely to mean that the content in the format isn't worth the effort of writing a new renderer from scratch on the basis of the preserved specifications.

2. None was ever written because the owner of the format never released adequate specifications, or used DRM techniques to prevent, third-party renderers being written. The preserved specifications are not going to change this.

3. A open source renderer was written, but didn't work well enough because the released specifications weren't adequate, or because DRM techniques could not be sufficiently evaded or broken. The preserved specifications are not going to change that.

4. An open source renderer was written but didn't work well enough because the open source community lacked programmers good enough to do the job given the specifications and access to working renderers. It is possible that the (much smaller) digital preservation community would be able to recruit programmers who were better enough to handle the task without access to a working renderer, but it isn't likely.

5. An open source renderer was written but in the interim was lost. I argue below that open source is far better preserved than the content we are talking about. If open source code is being lost we're unlikely to be able to preserve the content, or even the format specifications.

6. An adequate open source renderer was written, but in the interim stopped working. I have argued elsewhere that the structure of open source makes this unlikely, and history supports this. But even if it did, the cure is not to throw away a once-working renderer and create a new one afresh from the format specification; it would be a far easier task to fix the reason the renderer stopped working. The preserved format specifications are useless for this purpose. What is needed is information about the changes to the operating system that stopped the renderer working. For an open source operating system, this is available from the source code control system, which is also incidentally capable of reconstructing the operating system as it was in the days when the renderer worked.

This analysis doesn't look encouraging for the proponents of preserving the specifications. But lets blithely ignore these problems and press on with the assumption that somehow a poor sucker has to create a renderer from the specification. How realistic is this task?

First, we actually know how much work it is to do a clean-room implementation of Microsoft Office's formats. Several open source products have done a credible job of doing so, including Open Office. In the nature of open source development, successive products are able to build on the work done by others, so the total amount of work is greater than any individual effort committed, although less than the total of all efforts. The history of Open Office reveals a very large investment; it was originally developed as a commercial product, and its development continues to be subsidized by Sun Microsystems as a basis for a commercial product. To achieve its current functionality has taken a significant, salaried team more than a decade. It is not credible to expect that this level of effort could be justified by digital preservation activities alone.

Second, the task envisaged is actually far more difficult than a simple clean-room implementation of the format. The whole justification for the task is that there is no functional renderer for the format available. Thus there is no way for the poor sucker to test his interpretation of the specifications against the original. The effort needed to achieve a fidelity of rendering equivalent to Open Office's would therefore be much greater than was required by the Open Office team, who could test their interpretations against Microsoft's code.

Third, the digital preservation world often complains that even Open Office's level of fidelity is inadequate. Many of these criticisms are beside the point; they refer to inaccuracies in Open Office's rendering of the latest Microsoft Office formats. But from the perspective of digital preservation, the relevant criterion is Open Office's rendering of old, in fact obsolete, formats. After all, the precondition for the task of creating a clean-room renderer is that the formats are so obsolete that no functional renderer is available. In my, admittedly limited, experience Open Office often does better than the current Microsoft Office at rendering really old documents. And note that the most recent case of Microsoft Office format obsolescence was caused by Microsoft's deliberate decision to remove support for old formats. This was so unpopular that it was rapidly rescinded. No-one is arguing for Open Office to remove support for old formats, and it appears that even Microsoft's ability to do so has expired.

Many of the criticisms of Open Office's fidelity in rendering Microsoft Office documents relate to layout changes between the two renderings. These are beside the point for another reason. The changes are typically caused by small differences between the fonts available in Microsoft Office and in Open Office. They exist not because Open Office incorrectly interprets the Office document format, nor because the Open Office developers were incompetent. They would plague the poor sucker's renderer just as much. Fonts, and in particular the font spacing tables that drive the layout process, are protected by copyright. If the Open Office developers had copied the font spacing tables so exactly that there were no layout changes they may well have been breaking the law.

Just because a document format has gone obsolete does not mean that the fonts used by documents encoded in that format have gone out of copyright. The poor sucker is likely to face even worse intellectual property hurdles than the Open Office developers did. He will probably be faced with the orphan font problem; wanting to get permission to use a copyright font but being unable to find the copyright owner to ask for it. The need to preserve the fonts used by a document as well as the text motivates the ability of PDF to embed the fonts it uses into the document itself.

Fourth, there is behind this discussion an unrealistically black-and-white view of the world. Renderers are software. They all have flaws. Some are better than others, but none is perfect. If we plot the quality achieved by a newly created renderer for a format against the cost of creating it we will get an S curve. A certain amount of money is needed to get to a barely functional renderer. Beyond that, quality increases rapidly at first but after a while the law of diminishing returns sets in. Getting from 99% to 99.9% is very expensive; the cost of getting to 100% is infinite. Emulation of the entire original hardware and software environment is the only way to guarantee 100% fidelity. Anything else means that preserved content will be rendered with flaws. The only real question is how much to spend to get to how close a rendering.

Fifth, we have a way to greatly reduce the cost of getting to a given level of fidelity. As we see with Open Office, creating an open source renderer for a format before it goes obsolete is much less costly than doing so afterwards. This is especially true because the open source community will almost always do this on their own initiative, without needing resources from the digital preservation community. They want to access documents in the format here and now; a much more powerful motivator.

Even better, they will then preserve the resulting renderer far better than most digital preservation systems preserve the content entrusted to them. Open source code is in ASCII, so there is no risk of format obsolescence. Just as Creative Commons licenses do for copyright content, open source licenses permit all the activities needed to preserve the code, without negotiation with the copyright owner. Open source code is already preserved in large, well-funded, independently managed repositories such as SourceForge. Further, open source teams maintain many copies of their work, both in the form of nightly backups of their part of the repository, and in the form of working copies of the code. Finally, just like internet protocols (90K PPT), open source development is so decentralized that flag days or changes that break applications are very difficult and time-consuming, and thus very unlikely.

It seems clear that preserving the specification for a format is unlikely to have any practical impact on the preservation of documents in that format. If, during the currency of the format, it acquires an open source renderer there is no significant risk of ever ending up without a functional renderer. The need for a new one to be created from the specification is extremely unlikely ever to arise. If that unlikely event ever happened, it is hard to believe that resources on the scale needed to do the job would be available. And in the unlikely event that they were, it is unreasonable to believe that the combination of the preserved specification and the available resources would be enough to create a renderer that would satisfy those who reject Open Office because of minor rendering flaws.

Don't let the perfect be the enemy of the good.

Clearly, formats with open source renderers are, for all practical purposes, immune from format obsolescence. Equally, preserving the specifications for formats which lack an open source renderer is likely to be ineffective in assuring future access to content in those formats. Effort should be devoted instead to using the specifications, and the access to a working renderer, to create an open source renderer now. In addition, national libraries should consider collecting and preserving open source repositories such SourceForge. They are essential to the library's efforts to preserve other important content, such as Web crawls. There are no legal or technical barriers to preservation, And who is to say that the corpus of open source is a less important cultural and historical artifact than, say, romance novels.

8 comments:

Chris Rusbridge discusses this post on the Digital Curation blog. He's reluctant to believe that preserving specifications is pointless, pointing out that since they're very cheap to preserve they don't have to have a big impact to pay back the investment. He makes a useful suggestion:

"So I would like someone to instigate a legacy documents project in Open Office, and implement as many as possible of the important legacy office document file formats. I think that would be a major contribution to long term preservation."

I agree, but I believe that in most cases the specifications for "important legacy office document file formats", such as the example he uses of ancient Mac PowerPoint version 4.0, are not available to be preserved or to contribute to this process. Microsoft's specification release, which started this discussion, applies only to current formats. Volunteers for an effor tof this kind would need to use preserved operating system and application binaries, together with reverse-engineering techniques, to create working support.

The reason Chris' 1990s Mac PowerPoint files can't be opened in Open Office is that they date from well before the project had achieved critical mass. Although the project has roots in the 1980s it wasn't open source until 2000.

There will always be considerable differences between the practice of digital archaeology, rescuing data from the pre-history of computing, and digital preservation, preparing current data for its trip into the future and caring for it along the way. Archaeology is a lot more expensive; the data needs to be a lot more valuable to justify the cost.

Thanks for the thought provoking post. I particularly liked the call to libraries and archives to consider archiving opensource software found in code repositories like SourceForge. How could we make that happen?

Just as an aside: while I was reading your post I was reminded of the notion of code-as-documentation. The idea being that the best documentation of what a piece of software is doing is the source code itself ... rather than some ambiguous, natural language description of it. I think that because opensource is, well, open for everyone to see, there is an incentive to make it easy to understand and navigate the source code, so that other people can quickly and easily contribute to the project.

While I agree that having an "open source renderer" for a given format is useful - it does NOT actually help the preservation process UNLESS said renderer is a "complete and fully functional implementation" of the standard (aka a "reference implementation").

For example, there are some EXCELLENT open source PDF renderers (Xpdf, Poppler and Ghostscript being the most popular) - YET NONE OF THEM implements even the complete PDF 1.2 specification, let alone the current ISO 32000 (PDF 1.7) standard. Sure, they have implemented "bits and pieces" of various newer versions of the spec - but not the full specification.

So how does that help ensure that PDF documents preserved today can be properly rendered in the future? It doesn't! But having PDF has an ISO standard, which is available for reference by a future programmer faced with the task is...

The scenario "open source tool exists but doesn't implement a part of the specification that a set of files needs" seems to me to be sufficiently likely as to justify preservation of the specifications.

In response to DrPizza, "open source tool exists but doesn't implement a part of the specification that a set of files needs" is presumably covered by either case 3 or case 4 in the original post.

In response to leonardr, the set of PDF capabilities that is necessary for digital preservation is defined not by the specification, but by the subset of the specified capabilities actually used by the documents to be preserved. The full PDF specification defines many capabilities that are extremely problematic for preservation; that is why a subset (PDF/A) has been defined for preservable PDF. So we see that a "complete and fully functional implementation" of the full specification is not needed for effective digital preservation.

Of course, the necessary set varies through time. We recently saw the first use in an electronic journal of PDF's 3D capabilities. Open source renderers don't currently support these capabilities. There may be some difficulty in doing so since I believe the technology is proprietary and comes from Right Hemisphere, an interesting New Zealand company. Only time will tell whether these capabilities become widely enough used to justify the work of adding support to the open source renderers. The account of the considerable efforts needed to create the 3D figures isn't encouraging in this respect.

The issue isn't whether to use the specification, rather when to use it. It is much easier and more likely to result in future legibility of the documents to use it now.

Preserving a specification which describes a standard format to which the documents to be preserved don't entirely conform in the hope that someone in the future will use it to create from scratch a renderer may be very little effort, and it may give rise to warm and fuzzy feelings, but it isn't nearly as likely to be as effective as buckling down to the work of creating an adequate open source renderer.

As ISO Project Leader & Editor for PDF/A (ISO 19005), I am quite familiar with it...and you are right, it does define a subset of the complete standard that is suitable for long term archiving/preservation of "electronic paper". Yet even there, not a single open source renderer implements the complete PDF/A-1 standard (based on PDF 1.4). And the committee is in the process of completing work on PDF/A-2 which adds support for newer versions (1.7/ISO 32000-1) of PDF.

The 3D support in ISO 32000-1 is based on an open standard called U3D (ECMA 363) - so nothing proprietary there either.

Don't get me wrong - I am a HUGE supporter of open source and have actively contributed (incl. source code!) to EVERY open source PDF project over the last 10 years. I still maintain active membership on the mailing lists of many of the projects. I am simply expressing that fact that since most open source projects are done by volunteers who choose the features that THEY want - that is what you get.

I'd LOVE to see a fully funded/supported open-source PDF/A-1 compliant render developed...but it still has NO BEARING on the fact that having the standard (which will most certainly outlive the source code) is more important. And I suspect on that point, we will continue to disagree.

First of all I agree with some of your comments. But my original discussion was to simply ask if a particular set of specifications for Word were a good set of Structure Representation Information (not knowing a vast amount about the internals of a word document myself).

Structure Representation Information (Structure RepInfo) is clearly defined in OAIS. Basically it is the informationabout the bits in a data file that allows you to extract data values (numbers, characters strings etc) and nothing more. OAISdoes define two classifications of RepInfo relating to software. They are Representation Rendering Softwareand Access Software. The reason it highlights these two types of software is for the exact reasons you are highlighting in your post. Themain one is that Structure RepInfo may be, in some cases, difficult to get access to or to formally define, and hence the only way of reliably accessing data values from these file types is via the Access Software. The other is that once you have access tothe data values then what do you do with them? For data contained in files like Word, PDF, or Open Document format then typically (but not always) you want to render that information for human consumption, i.e you need the Representation Rendering Software. As far as I am aware there are no formal mechanisms for describing the rendering of a document but I think there is some research to abstract such information. Perhaps using a common file format such as Open Document does substitute for an abstraction of rendering information for all types of document formats? But can we force everyone to use Open Document format? I think not.

The problem with software is that no one knows how to preserve it. Research into emulation is ongoing, but at the momentis is limited to "lets just write another emulator", which itself is just another piece of software. People are still notthinking about what is the adequate RepInfo required to preserve software. For an emulator I would say that theRepInfo is the knowledge required to rewrite an emulator in the future.

You may then say, "but we are back to some poor sucker in the future writing an implementation of the emulator from the RepInfo". Yes we are, but at least that emulator could then be applied to many different sets of Access and Representation Rendering Software that then can be applied to many different data sets . So from a cost point of view rewriting a emulator is a very cost effective solution because it could potentially apply to so many different data sets. The problem is defining the adequate set of RepInfo that would"guarantee" that someone in the future could implement the emulator. I do not know what the adequate set of RepInfo isfor an emulator, but to me it is a very important preservation research topic that should be addressed by someone.

The other type of software preservation solution is simply just the keep the source code if available or only use formats etc that have an open source implementation. But keeping is not preserving. In a software source set you have potentially many different file types, C source, Java etc. Each one of these files needs RepInfo to be reused in the future. Some of the reuse cases I see for source code are: recompiling for a new computer platform (and as yet unknown platform); fixing the software to use new or updated libraries; and possibly reusing the code as an information source to migrate the algorithms to new programming languages. If you just keep the files then all you have is some files with a sequence of bits in them which are meaningless on their own. Will everyone always understand C/C++, probably not. Will everyone be able to understand what the intended functionality of the software was in any detail?

The main conclusion from this is that Structure RepInfo is very easy to define, it is in fact the simplest form of RepInfo. There are now a few good formal languages for defining it in detail, one good example is EAST (ISO 15889). The other types of RepInfo such as software are vary hard to preserve - research is still ongoing and it may turn out that it is impossible to preserve. So for the moment I would not rely on just keeping the software.

The other type of RepInfo I have not mentioned is Semantics. Semantics is ignored by most, but is in fact the most important form of RepInfo, and problematically, the most complex to define. For example, if you render a MS spreadsheet, are you guaranteed to know what all the data values mean in the spreadsheet? Do the data values have units, a human readable description of what they represent? Does the spreadsheet contain a table of data? If so, what are the relationships between the columns? Is there more than one table? If so are the tables related, and if so how? Most people only talk about structure and rendering and do not realize the complexity of the semantic relationships that can exist between data vales or data objects. I mainly deal with scientific data, and not documents. In scientific data the semantics relationships are usually clear but poorly defined but are absolutely essential for the reuse of the data. I have seen very little on study of the semantics of the content of document type data but I have always assumed that, like scientific data, it will be just as rich and semantically complex?

Thanks to everyone for the stimulating and useful discussion on this post and its successor. In particular, Steven Rankin's comment deserves a post to itself, which unfortunately will have to wait while I prepare a talk that will draw from these discussions, and catch up with work.