Where data goes when it dies and other musings

I’ve been wanting to write about Ma.gnolia’s catastrophic data loss last week ever since it happened, but wasn’t quite sure how I wanted to approach it. Larry (Ma.gnolia founder and the sole person who maintained the site) is a good friend of mine, and Ma.gnolia was one of Citizen Agency’sfirst clients. It’s been painful to see him struggle through this, both personally and professionally, and it’s about the worst possible [preventable] thing that can happen to a Web 2.0 service.

Still, kept in context, it’s made me reconsider some things about the nature and value of open, networked data.

I. How I Learned to Stop Worrying and Love the Bomb

According to Google’s cache of my profile on Ma.gnolia, I had accrued 5758 bookmarks and 6162 tags since I first started using the service August 08, 2004. That’s a lot of data capital to have instantly wiped out. You might think that I’d be angry, or disappointed. But I’m surprising zen about the whole thing. Even if I never got any of my bookmarks back, I don’t think I’d be that upset, and I’m not sure why.

If Flickr went down, I’d be pretty pissed. But Ma.gnolia for me was primarily a tool for publishing — something that I used to broadcast pointers to things that I took a momentary fancy in. There’s a lot of history in my bookmarks, no doubt. In some ways, it’s a record of all the things that I’ve read that I thought might be worth someone else reading (hence why my bookmarks are public), and clearly is a list of things that have affected and informed my thinking on a broad array of topics.

But, the beauty of bookmarks is that they’re secondary references to other things. The payload is elsewhere and distributed. So in some ways, yeah, I mean, there’s a lot of good data there that’s been lost (at least for the moment). But, the reality is that the legacy of my bookmarks are forever imbued in my brain as changes in how my synapses fire. The things that I can’t remember, well, perhaps they weren’t that important to begin with.

II. Start over; the blank slate.

With the money I won from the Google/O’Reilly Open Source award last summer, I decided I’d break down and by myself a new MacBook Pro. As I was initially setting it up, I figured I’d transfer my previous system setup over from my Time Machine backup and just pick up from where I left off.

I did this, but once I logged in, the new MacBook lost it’s feeling of newness, and I felt encumbered. What amounted to bit-for-bit data portability left me feeling claustrophobic and restricted. I wanted the freedom of a clean system back; somehow buying a new machine wasn’t just about better performance, but about giving myself license to forget and to start over and to make new mistakes.

III. And the band played on

While I love the form-factor of my MacBook Air (now my previous system), the first generation just isn’t fast enough or beefy enough for the way that I use a Mac. It’s great for email and traveling and it really is the machine that I want to be using — just with better performance (though I hear the new models are much better).

Because the hard drive on the thing is pretty miniscule by today’s standards (80GB), I quickly maxed it out with music, videos, photos and screenshots. I was down to about 6GB of space, and OS X crawls when it can’t cache the shit out of everything so I decided to take aggressive action and deleted my entire 30GB iTunes library.

Command-A. Command-Delete. Empty Trash.

And then it was done.

Now, I still need iTunes for iPhone syncing, but now I have no local music store. With the combination of Spotify, SimplifyMedia and Pandora (using PandoraJam or PandoraBoy), I’ve got a good selection of music wherever I’ve got wifi.

The act of deleting my entire music library (okay fine, I do have a complete backup on my Mac Mini media center) was cathartic. All that data… in an instant, gone. All those ratings, all that metadata, all those play counts revealing my accumulated listening habits. Gone (well, except for my Last.fm’s profile).

Of course, it’s not like I had original, irreplaceable copies of these tracks. There are copies upon copies out there. And knowing this, I intentionally destroyed all this data without really worrying about whether I’d ever be able to re-experience or relive my music again. In fact, I didn’t even give it a thought.

But my system sure seems a bit faster now.

IV. Microformats are the vinyl of the web

The first thing that I thought about when I heard that Ma.gnolia had had “catastrophic data loss” was that Google and Yahoo probably had pretty good caches of the site, especially given its historically high PageRank. The second thing that I thought about was that, since the site was microformatted with XFN, xFolk and other formats, recovering structured data from these caches would likely be most reliable way of externally reconstituting Ma.gnolia, in lieu of other, more conventional data retrieval methods.

Though Larry is still engaged in a full out recovery process, it gave me some sense of pride and optimism that we had had the forethought to mark up Ma.gnolia with microformats. Indeed, this kind of archival purpose was something that Tantek had presaged in 2006:

Microformats from the beginning in my mind are serving two very important purposes.

Microformats provide simple ways of identifying larger chunks of information on the Web for easily and immediately publishing, sharing, moving, aggregating, and republishing.

Microformats are perhaps a step forward in providing building blocks for the longevity of higher fidelity information as well.

In talking with Tantek about this, he pointed out some interesting things about many modern web services, lamenting their apparent lack of concern over longevity. For example, clearly there is a great deal of movement afoot to advance the state of distributed social networking, as evidenced by XML and JSON-based protocols like Portable Contacts and Activity Streams. But these are primarily transaction-based protocols, and archive poorly (another argument for RESTful architectural, certainly).

I would therefore agree with Tantek’s oft-repeated admonishment that services that are serious about their data should always start by marking up their sites with microformats and then add additional APIs to provide functionality (as TripIt did). It’s simply good data hygiene. It’s also about the separation between form and function (or data and interactivity). And with emerging technologies like YQL, people can now build arbitrary mashups from the HTML on your homepage, without even having to know about your custom API.

It also means that, in the event of catastrophe (Ma.gnolia’s case) or dissolution of a service (as in the cases of Pownce, Journalspace or Consumating), there is some hope for data refugees left out in the cold.

When APIs go dark, how do you do a data backup? (Answer: you often can’t.) With public, microformatted content, there will likely be a public archive that can be used to reconstitute at least portions of the service. With dynamic APIs and proprietary data formats, all bets are off.

V. Death and data reincarnation

With both the intentional and unintentional destruction of data recently, it’s given me lots to ponder about in terms of the value, relevance, importance and longevity of data.

I talk about “data capital” like it matters, because I suppose I want it to, and hope that someday it does make a difference just how much of yourself you share with the world, simply because it’s better to share than not to.

And now I’m in this funny situation where, because I did share, and shared openly (specifically on Ma.gnolia), there is the very real possibility of reincarnating my data from the ether of the web. It could just be that all the private data, including messages, private bookmarks and thanks are forever gone, because they were kept private. But those things which were made available to anyone and everyone, through that simple aspect, can be reconstituted by extracting their essence from the caches of the internet’s memory banks.

You think about photographs of people who have died, and of videos and other media. In the past several years we’ve had to start thinking about what happens to social networking profiles on Facebook, MySpace and Twitter of people who are no longer with us. Over time, societies have invented symbols and rituals to commemorate the dead, and often use items imbued with the deceased’s social residue to help them remember and recall and relive.

How do that work when those items are locked away in incompatible and proprietary data stores? How do we cope when technology gets between humans and their humanity?

The web is a fragile place it turns out, in spite of its redundancy and distributed design.

Efforts that threaten to close it up, lock it down or wall it into proprietary gardens are turning the web against us, against history and against civilization and the collective memory. This is perhaps one reason of the primary reasons why the open web is so important to me, and factors in so centrally to my work. As I grow older, perhaps I won’t always have perspective on which things will be the most important to me, but it’s critical that in the future, I don’t inhibit my and my progeny’s ability to access my digital legacy.

I find it fitting that Ma.gnolia uses an organic symbol as its logo. It has, for all intents and purposes, died.

But there is a silver lining here, and I think Larry intuitively understands: in the Ma.gnolia Open Source (M2) project, he had already sowed the seeds for Ma.gnolia’s rebirth. Though it is lamentable that a such disaster would occur, I believe that creative destruction is absolutely necessary to natural systems, as forest fires are critical to the lifecycle of forests.

I also believe that things happen for a reason and that the soil of this tragedy will lead to a new start and new growth. It’s not accidental that the design of M2 called for a distributed, redundant mesh of independent bookmarking service endpoints. If anything, this situation provides Larry license to start anew, proving the necessity of death, and the wisdom of genetic inheritance and variation.

18 thoughts on “Where data goes when it dies and other musings”

I think V is some of the most sage writing on data portability out there. I wonder how people would feel about losing their twitter history? Is it more about publishing or archiving?

The writings of entire eras are lost because the technology of the time was papyrus, which gets wrecked by moisture within a century. We know more firsthand about some civilizations over others simply because they used stone tablets over clay. If ma.gnolia’s data vanishes, future generations lose out on the opportunity to understand an entire culture of early social bookmarkers. Maybe that doesn’t seem like a big deal to us, but perhaps the ancient greeks didn’t feel much different when their papyrus scrolls got wet and turned into pulp. Unfortunately, we’ll never know how they felt for certain. Very little is left.

You mentioned the affects of losing data in the specific case, but what about in aggregate? Ma.gnolia didn’t lose just your data, but everyone’s data.

This is what’s important to me when I hear you talk about the open web, anyway. On a long enough timeline, the politics behind privatizing data become obsolete. Our data is our legacy as a civilization, and it should be preserved for the lessons it can teach. However, humans are shown to be horrible conservators.

Excellent post. And Ian – good point on Flock. Clients that can automatically consume and store your hard-fought data out there in the ethers is a potentially good model for data continuity. Things like microformats and RDFa look like great candidates for standardizing how that data looks to such a client.

Ok, this may be over the top but I believe Chris may have written here the “Groundhog Day” of blog posts in that it addresses timeless truths about life and living and people will be mining this for a long time to come for deep religious overtones and cosmic karma.

I have so much to say about this, I can’t get it out in words at the moment. Let me just try to get some disorganized words out before I forget certain trains of though. Part of the lesson is to share and to share now. We are what we share with other people. That is our true legacy – how we impact our fellow humans now and enable them to act now, inspired by us. And this: I am not longer contained at a particular blog. I am spread out all over the internet in blog comments left all over the place, in Twitter tweets, last.fm, in Friendfeed posts and comments, in Flickr … I can go on and on. I am the aggregate of all that and i can’t point to one place that best defines me. And then there is the Zen concept of Now, the Present. Not past, not future. It is what we do now that is real. i’ll have to come back to this comment and organize it and relate it all back to pieces Chris has written.

“…For example, clearly there is a great deal of movement afoot to advance the state of distributed social networking, as evidenced by XML and JSON-based protocols like Portable Contacts and Activity Streams. But these are primarily transaction-based protocols, and archive poorly.”

Each Twitter status has its own URL, and they are being cached by Google Yahoo with the microformat markup intact. Does that count as a Tantek approved persistence for an activity stream?

Neat idea about reinstalling your hard drive. While I’m unsure if I want to follow your steps, you’ll be pleased to learn I’ve sporadically deleted apps running on my PC when I realize I hardly use them. Might as well save the same for PDFs and other stuff I download.

A lot of the writing on the web is more like talking than writing. When you talk, there won’t be an archive where it’s stored. And you don’t care. Same should hold for a lot of the stuff we write on the web. It’s short lived communication, not important enough to be archived.

Twitter is like talking. Backup shouldn’t be that important. Image everything anybody ever said was archived somewhere… Where do your words go after you’ve said them?

@ako: well, it’s not exactly that straight forward. While I agree that we’ll see a shift to more real-time interactions on the web, having a record of interactions and exchanges can prove valuable, in a way that we may not currently be able to imagine. While I don’t mourn the [present] loss of my bookmarks, I still think that it would have been nice to have a collection of them, especially given the metadata that I personally added to them (tags, descriptions, ratings, etc).

The same holds true for Twitter — I oftentimes cite Twitter as a source of news. If Twitter didn’t have permalinks per post, that would be like newspapers not keeping back issues of prior papers. Losing that history, to me, would be a crime!

@Tom Gardner: your point is orthogonal to mine. It goes without saying that having a robust, triple-redundant backup system is something that services *should* have. Ma.gnolia didn’t; that can’t be undone. You don’t realize the value of good backups until you need them, and that’s because, as they say, hindsight is 20/20.

Now, not all services will have a good backup system. Even Nokia got this wrong recently. You can chide all you want, but if a service goes down, sometimes backups will fail. Sometimes companies will go under. Sometimes sites will be removed from the net by overzealous governments. In those cases, having distributed caches across the web are your only recourse to retrieve public data. Microformats at least help to improve those situations, but certainly don’t replace a regular backup and archive routine.

@Kai: yep, that’s true. I was only discussing public data.

@Bengt: I don’t think Larry made any excuses. He screwed up, admits his mistake, and has taken steps to address the problem. He can’t go back in time, but if and when Ma.gnolia lives again, I’m sure he’ll trust someone else more versed in IT to manage his infrastructure. It’s not about making excuses; it’s about taking responsibility for this situation and accepting that sometimes shit really does happen, but at least if you take a simple step like using microformats (among other things), if and when it does, you have a distributed way of recovering the data.