That Which Survives

Recently in Tulsa, Oklahoma, a buried time capsule from 1957 was unearthed. Its contents largely consisted of a car, a 1957 Plymouth Belvedere. The creators of the capsule thoughtfully provided a separate can of petrol, assuming that the nuclear powered flying cars of 2007 would long since have moved beyond the need to burn fossil fuels to get around. In literature buried with the car it was billed as becoming â€œa priceless antique in 2,007!â€

Increasingly all our records are moving into electronic format. The rapid demise of photographic film is a good illustration on how quickly this can happen. Almost no one buys simple film cameras any more, nearly all of them are digital. These records are now stored on memory-sticks, or downloaded onto personal computers and stored on extremely fragile hard drives, usually without any form of back up. Even as a computer professional, I'm as lazy about backups as everyone else. One unfortunate accident and most of my precious memories would be irretrievably gone. You might hope that the professionals would do a better job. You'd be wrong.

Recently NASA discovered that the original â€œslow-scanâ€ tapes containing the images from the very first Apollo 11 Moonwalk were missing. The footage that everyone has seen is actually a copy of these tapes, created by pointing a 1969 television camera at the higher resolution video. In June 2007 as I write this the original tapes are still missing. According to NASA there are 2612 boxes containing tapes that may contain the original data but there are 13,000 additional tapes that are missing. I'm reminded of the government warehouse shown at the end of â€œRaiders of the Lost Arkâ€. I'm sure that NASA has â€œtop menâ€ assigned to the job of finding them.

Even if the tapes are found, I wonder how many 1969-era tape playback machines will be in working order so the data can actually be read ? How brittle will the magnetic tape have become? This is only forty or so years ago. What will the problem be like in one hundred years, or five hundred, or a thousand? The tapes of the first moon landing are a historical treasure, more important than any records of human travel to anywhere on the Earth. They are a record of the first steps our species made to leave our initial birthplace. If we can't look after these, what hope do we have for less important records? With written records at least we can still read them after many thousands of years, so long as we still understand the language.

I'm not a Luddite, believing that we need to stop migrating our data and records online. The benefits of doing this far outweigh any possible disadvantages. Digitizing all books and museum records, for example, will make records available to scholars who have an Internet connection and who might never have had the chance to visit the physical objects. Being able to search through them all is pretty nifty too. Unexpected but important new discoveries like new meteorite impact craters have been made now that global satellite imagery is freely available to anyone on the Internet. Who knows what people will find as more and more human knowledge moves into the public space from its current state of needing physical access? I don't want to give this up. But it would be good to have some thought given to preserving the ability to access the important historical records of our time.

I think proprietary record formats will present a problem for historians. Perhaps not in the short-term, but certainly in the medium to long term (and remember I'm talking about hundreds if not thousands of years now). Imagine that some historian in 500 years time discovers Vice President Cheney's â€œundisclosed locationâ€ and finds his secret laptop computer. â€œFinally,â€ the historian thinks, â€œwe will know who advised this administration about energy policy!â€ as he swims back to the surface of the ocean above the Washington monument. Unfortunately it turns out the data was written in the â€œWord-mangler for Windows 2002â€ format, for which no specifications were ever published, and which was deliberately designed to be difficult for the competition to read.

Joking aside, proprietary record formats will increase the difficulty of preserving our culture, on top of the problems with obsolete hardware interfaces and the decay of storage media we think of as permanent. File data formats that are not published standards are just asking for trouble for long term data storage. Much though I prefer the OpenOffice â€œOpen Document Formatâ€ (ODF) data format for documents, the Microsoft â€œOfficeOpen XMLâ€ (OO-XML) format is also a documented format (although without any other implementations as yet) so it shouldn't cause problems for long term storage. However, most of the world's documents in both governments and corporations are still in undocumented proprietary formats, and it sometimes ends up that the documents that people don't think are worth preserving are the ones historians are most excited to find.

I'm not too worried about the â€œdoomsday scenariosâ€ of a post-industrial, non-electrical civilization having lost all scientific knowledge by being unable to read our word-processor documents. I somehow suspect that a society in that state has more to worry about than whether they can access old corporate financial records. No, the thing that worries me is someone suddenly wanting access to an old video recording of the fall of the Berlin wall and finding that there are only copies of copies of copies, the original â€œtruthâ€ of the incident being lost centuries ago. Or even worse, being unable to determine which version of the video shows the real event.

In a book by my favorite cartoonist, â€œ2024â€ by Ted Rall, a modern re-telling of Orwell's 1984, the hero idly edits the records of the Nobel Peace Prize winners and adds an obscure punk rock singer to the list. By the time the media picks up and re-broadcasts his changes he's forgotten he even did it and blindly accepts the new history along with everyone else.

Once an event has gone from living memory, and the only records are electronic and mutable, how can we ever know the truth of our past? Maybe a fragile DVD or other physical record that can be dated to an accurate time near a historical event will become more valuable than any electronic data, or might serve as the ultimate arbiter of the alternate digital versions already available. Maybe the curse we will leave to future generations is so many versions of the same data it will be impossible to tell which one, if any, was the original truth. Write back and let me know what you think.

Finally I can't help apologizing for criticizing NASA above by pointing out that they are responsible for the only really long term data storage human beings have ever created. On the Pioneer and Voyager space probes are plaques containing data designed to be read by extra-terrestrial intelligences, showing the position of our solar system. The Voyager space craft also carry a disk containing audio and video from the planet Earth. A billion years from now, well outside the Solar system, it is hoped these messages will still be readable by any intelligence able to retrieve them. Now that's a time capsule worth celebrating!

Comments

The key to long-term data preservation is to copy all your data onto the latest medium every few years. I have entire message forums from my 1985 BBS (originally on a TRS-80) that are now preserved as readable text files on modern storage disks.

The other key is to have multiple copies in different locations on different media. A portable hard drive in your closet, a DVD stashed in your desk drawer at work, etc.

Technology is on our side - each generation of media holds so much more than the past. A shoebox full of old floppy disks becomes one CD. A case of CDs becomes a few DVDs. A pile of DVDs fits on a portable hard drive smaller than a deck of cards, etc.

Proprietary formats will be an issue, but I'm willing to bet that reverse-engineering an old Microsoft Word document will be easier for future computer scientists than it sounds to us now.

Re: "Copy it" comments. The national archives is already doing the copying.
They have had a copy to new format program for some years. The problem still
is the proprietary format. There are also groups that keep alive old systems just
for the task of reading this old stuff. Fortunately, the users of these old floppies
still used filing cabinets.

Remember Wang VS? It was the "standard" w/p for the State Dept (among
others who had a need to preserve documents). Wang's customers were
in deep trouble when they folded. And Word has not made things better.
Jeremy is right. The real need for non-proprietary formats is for long term
preservation. That "nice to have" became mandatory once we stopped
printing the documents out and stuffing them in filing cabinets.

Re: reverse engineering those documents. It's not that easy. The ones who
do that work have the benefit of still functioning proprietary apps to see what
the document is supposed to look like. If our future historian can't boot
Cheney's laptop, it will be very hard work indeed. And no, M$'s 'XML'
standard is nowhere near what is needed. Just putting meta-tags around
binary hairballs does not make it open. ODF may not be the best design
on the planet but it is completely defined, continues to be refined in a
public process, and has multiple implementations that interoperate. It
is the only 'standard' in this space that can say that. Especially for any
legally binding document or government business document, this
standards process (open spec + multiple interoperating implementations)
should be mandatory. If it is not readable by _anybody_ 60 years from
now, it does not exist and is not binding.

Hmm. If the Patriot Act and its supporting docs are in Word 2002, does
that mean it is no longer a law once Word 20xx no longer supports the
format? Hmmmm. Government reform by archival amnesia. The
possibilities abound.

I read a while ago about a technology that might be of interest about digital data preservation. It's called LOCKSS, which stands for "Lots Of Copies Keeps Stuff Safe" :-) (lockss.org)
I'm not sure if they adress the proprietary format question, but as you can tell by the name, the idea is at least to have distributed copies of the information, Freenet-style. This way the historical information at least does not disappear for political reasons or because of hardware failure. I like the idea :-)
cheers,
webmat

I suspect that the society of the future will be led to believe that a certain rendition of the preserved record is the truth... exactly the same as you and I do now. Certainly there will be scholars who dig up quirky deviations from the accepted canon, but they will be shunned by 'serious' scholars and not have a significant audience.

As for proprietary formats, they are not much of a problem. No, they won't be easy for just anyone to read by glancing at them. But with advanced entropy analysis techniques, key 'rosetta stone' copies that correspond to the same material in different formats, study of cultural norms (even things like typefaces, when boldface is used, etc), and probably a hundred aspects of document analysis and philology that I don't know and a thousand that haven't been invented yet, recovering the data from proprietary digital formats will be very possible.

Awhile back I listened to an audio lecture about archaeology, and it put a lot of things into perspective with regards to this kind of issue. In a million years, say, it is very unlikely that any trace whatsoever of our species will exist to be found. We don't live in the right places, our goods and structures are not durable, etc. Consider how few fossils we have and how many billions and billions of dinosaurs there once were. Only those that happened to be preserved in places where the conditions remained very static for millions of years were preserved. Dinosaurs were flung all over the planet whereas human beings live, comparatively, on a tiny patch of the earths surface. And the places where we do live do not share the nearly-never-changing conditions necessary for preservation. If we truly want to preserve things for the ultra-long-term, a fortified bunker under the surface of the moon would be a good bet. Or a case buried deep in a peat bog perhaps, though anything on the planets surface is going to be subject to all sorts of natural destruction (tectonic subversion, glacial movement, etc).

This same problem exists with written records if you go back far enough because paper, papyrus and parchment have and had limited lifetimes. True there are still some original Roman writings which are readable but these are carved on stone. The multi-generational paper/parchment copies can be validated when new copies turn up (e.g. parts of the Old Testament were present within the dead sea scrolls). Also archaelogical finds of old buildings etc. occasionally turn up which tend to correspond with the information present in copies considered most authentic by the historians.

In the case of the New Testament, so many copies were made and distributed geographically to so many places early enough after the originals were created that this gives more credibility to the authenticity of this than we could otherwise have.

Historians interested in separating fact from fiction will still have to consider the motives of those who kept copies of things, how many copies were kept and how long after the original events different copies with histories of their own were kept. In a situation e.g. in connection with the writings of Plato, where the oldest manuscript in existence was produced more than a thousand years after a work was written and all current copies derive from this one copy, additional validation may arise through references to the same work within surviving documents believed to be older than the oldest known copy.

Even if something is carved in stone someone else can still plane the lettering off the surface and carve a different inscription underneath, (e.g. as was done during the de-Nazification of Germany in and after 1945), so information about the history of one or more copies of a document is nearly as important as the content of the document itself.

Let's face it: proprietary fromat means free competition at the price of lost memories. Standardization means law enforcement with the benefit of live memories virtually forever. It's muuuch more comfortable to fight on marketplace a "guerilla war" than directly facing the competitor, striking it straight and accepting direct strikes...
The past is essential for the future. Don't remember who said or wrote that! (see it? even a simple thing like this is sooo easy to forget...)

I have an open-source project which is a simulation of the guidance computer used on the Apollo command module and lunar module. (The project is called "Virtual AGC"; if you're interested, you can google it or look it up at freshmeat.) But for the simulated computer to run, you need the code for the programs the real computer ran. This software varied on a mission by mission basis. Private collectors have made the effort to scan and make available one version each of the command-module and lunar-module programs, but by no means the most significant versions of those programs --- the most significant versions, I would think would be those for Apollo 11 or 17.

Yet, after literally years of trying to find additional versions of the software, I've come up empty. Some is in museums which won't grant access (or in some cases even respond to you), others are in private collections of people who won't be bothered to grant access (or in some cases even to find the material), but none is available from the feds as it should be. The National Archives, which has lots and lots of the paper documents, has none of the software at all!

Shame on you, NASA, and shame on everyone else involved, other than those few private individuals who have bestirred themselves for the public good!

We fought a lot with this at Siemens (Sietec) about fifteen years ago, when trying to decide what format to use on stackers full of 12" WORM disks, which were just nicely becoming useful for large-scale archival storage in those days. We needed format that would outlast the disks, which probably meant 50-100 years assuming normal replacement/turnover.

We ended up with the bottom level being a WORM standard, which was served out to users via the NFS standard, which was reasonably close to a Unix filesystem, and was usable by Windows clients, and finally we stored the data in quite simple random files with tables of contents, so we could handle multi-page documents.

In practice we found the data we were storing was almost always images, as that what businesses wanted to store: scanned images of legal, business and medical documents. As a coomentaor on slashdot suggested, we used as simple a format as possible, but no simpler (;-))

For text documents, I recollect we did support some commercial formats, but only ones for which we knew the full specification and had a translator in source form. Our own data was mostly LaTeX, the typesetting language, expressed as ascii (now ISO) characters, and occasionally postscript or pdf, ditto.