Q&A: The Long Term Prospects — June 8, 2011

As someone who is a computer historian and is into the whole data archival and preservation area, I thought you might be able to answer this question. We have some data, including precious family audio recordings from years ago, which we are currently keeping on several hard drives. If all copies were lost, this data would be absolutely irreplaceable. To keep this from happening, I’d like to store it on a medium not requiring the built-in electronics of a hard drive to recover at some future date. Obviously a hard drive in storage is great, but if one tiny chip decided to blow the last time you powered it down before putting it in storage, you’re going to have a rude awakening when you try to use it again years down the road. The media I am considering are SD cards, thumb drives, and DVD’s. Obviously a thumb drive could have the same problem as a hard drive, in that the electronics needed to read the media are contained in the media itself. On the other hand, because the necessary electronics are included, all that is needed to get the data off the drive is a USB port on the target computer. With SD or any other card-based storage, the problem becomes, what if a reader for your chosen format is not readily available when you need to read the data?

What media do you use for archival of data having long-term interest or importance? Obviously in your case, since some of the data you have is for public consumption, if you had a massive media failure, members of the public could contribute. But I’m sure you have lots of data which is not public, and of which you have one of few, or the only, copies. So how do you store it to try to insure that, five or ten or twenty or fifty years down the road, it will still be recoverable?

I think what brought this whole subject up is, last night we had an external USB hard drive fail. Fortunately, I don’t think much if anything was lost. But it just brought up the question of how to back precious data up in the most reliable way. For me at least, the thought would be to back the data up onto a medium, store it away, and hope we never have to use it.

Thanks for any thoughts.

Jayson

The huge big fat secret is that the digital world has a major problem with data. It is extremely easy for us to move data around, and store it in a lot of things, be they USB sticks, a random hosting service, some hard drives, or tape. It’s rather easy to do that.

What’s not so easy is to keep any of these things around for a very long time. And by ‘very long’, I’m going to say ‘ten years’. With some professional-grade tape, there’s estimates of a quarter century, but most people don’t have those types of tapes or those kind of tape drives. Hard drives, especially customer-grade hard drives, are completely random as to their death rates – I’ve been able to get things from a decade ago going with no problem, and I’ve had drives die the first time I dumped 200gb of data onto them, straight. It’s a big, crazy mess and it’s the secret big problem with all this computer stuff.

Here’s what I personally do.

First, I separate my data three ways:

– Personally Created
– Things I Like
– Things Everybody Likes

Personally Created is anything that came off of my hands or my works or that I helped bring into digital existence. It is, to an individual, the most precious of all materials: writings, artwork, music, raw footage, photos… all the “stuff” that a person makes. It varies, of course – I include things like e-mail, weblog entries, cat tweets, video editing save-offs, which happen to also be things I created. If I lose them, they’re gone. Nobody’s going to have these unless I actively go out and put them in other peoples’ hands.

Things I Like are things that are out there that may be unique, maybe not, and I am in possession of them. For example, someone sends me a home video or snaps a photo with their phone and sends it via SMS to me, or who walks up to me and hands me a hard drive with stuff. (This is stuff under the “you should have this” department, which has been growing steadily for me as I’ve become known as the digital heritage and preservation guy.) – What demarcates this is that it is probably unique and losing it would be a bummer, whereas losing the Personally Created stuff is in some way a disaster. A disaster you may be happy about, but still, a catastophe of data. With “Things I Like”, maybe not so much. Some very unique things, some not unique things. This is the sort of stuff that most surprises a person later, like “holy crap, you saved all my doll pictures I sent you” or “woah, it turns out YOU had the blueprints drive”.

Things Everybody Likes are things that are out in the world, that we all share because they’re known entities. Examples are distributed music, videos, games, applications, development kits, CD/DVD images… stuff which you didn’t make (ripping from a disc doesn’t count as “making” in this context) and which it is VERY likely you could get your hands on again. Or mostly likely. You know.

Now, there will always been exceptions, but I find that there tends to be, in anyone’s personal collection, a small percentage of Stuff Created, a smaller or same amount of Things You Like, and then a lot of Things Everybody Likes. The mistake a lot of people make in thinking about this problem, if they think about it, is they lump all three things together – they have this small amount of preciously created stuff, the stuff that you’re really worried about with regard to longevity, and then there’s these middle of the road items that you’d be super bummed if you lost but not, you know, throwing things around. And then gigabytes of awesome music you like but which could be downloaded right now, in minutes, from a variety of sources both legitimate and not. And so if you think of the gigabytes problem, now you’re sad – that’s too much to deal with. But if you take, say, the sub-gigabyte amount of personally creative stuff – now you’re talking!

The key to what I do is sharing – with very, very little exception, I share everything I have, constantly, to as wide an audience as possible. Instead of worrying about the scans I did of stuff I have in my archives, I put them on archive.org or textfiles.com or my weblog and probably a pile of other locations. I hope everyone enjoys them. Or maybe ignores them. Or re-discovers them every few years and I see my viewer counts shoot up in that realm. Whatever. It’s about as saved as anything else.

I also have the “4-3-2 rule” – four copies on three things in two locations. I sometimes break this rule, of course, but not often.

What has been happening so far is the legendary “russian doll” storage method, where each new hard drive has a folder on it that’s the previous hard drive. This sort of works, and I keep these around, and don’t throw out the old ones – but it’s not really the kind of storage I’d suggest. I’d suggest, instead, this idea of curation of the personally created items, and even the “everybody likes” stuff, and then putting them in this separation out into whatever mediums you have.

Will you accumulate? Hell yes you will – it’s kind of what we do. And yes, there’s a chance that if you’re hit by a car, things will disappear that maybe shouldn’t disappear. During one of my hospital stays, I calmly gave my accompnying friend all my passwords and what to do with all the “major” material and collections at the time – he didn’t like me talking like that, but it calmed me down and made me rest easy – I’d tried.

But the rule still holds – every five years, like a checkup, you should assess your digital storage, really give it a run-through, really see what you have. Use a marker and write what the hard drive is on a label on it. Put a label on a box. Realize, at the end of it, that we’re just people, and we’re keeping stuff on some crazy toys, and the sun comes up and sets and life goes on.

I like your method of dividing up the data into different categories. I’ve unconsciously done something similar to this when prioritizing data for different levels of redundancy.

Here’s a question I’ve always wondered about: How do we protect data against EMP? Much of the world’s data resides in US datacenters, so if some country went nuts and hit the US with such a weapon what would happen to the data on magnetic storage? How about flash storage?

Thank you for validating my method! I tried the Russian Doll thing — it’s why I still have some of my oldest stuff at all — but for the most part I simply publish online everything I create, in addition to saving it on various media — an USB stick here, a SD card there. And I do it as I work, not in big lumps whenever I remember.

Of course, no discussion of backups is complete without the famous saying of Linus Torvalds: “Only wimps use tape backup: real men just upload their important stuff on ftp, and let the rest of the world mirror it ;)”

Okay, so I’m the guy who asked this question. But I have some other tips which should be quite obvious, but there’s always someone who has to learn the hard way. Never store anything even remotely valuable on only one drive. If a hard drive fails and you have a backup, insure you still have a backup after you install the new drive. Never count on any one hard drive to be accessible at any point in time. I heard of someone who, having just purchased a 2GB drive, copied all their data over to it, then promptly erased the smaller drives. Very next day, guess what happens? That’s right, the brand new 2GB drive died, and they lost everything. This is painfully obvious, but you never have a good enough reason to delete your only backup. If you need the space, buy another drive instead. I know of someone who did this. He wanted to install a game or something, so deleted his only backup. Wouldn’t you know it, during the time he didn’t have a backup, the drive in question failed.

[…] way to ensure the longevity of data is to put it in as many hands as possible — and no matter the issues involved with digital distribution, from file compatibility to DRM, print will always remain viable and accessible. The above archives […]