Sunday, March 15, 2009

I've been thinking a lot about data generation and maintenance. I generate a lot of data when I'm just sitting around doing nothing, and even more when I'm working. Having a good system for data organization is key, but I'm becoming increasingly concerned about my data's lifespan.

For a while now, I've been trying to save as much of my spreadsheet data in plain text format as possible. It's the natural thing for me to do as I've gotten more proficient with R and now Python, but it also saves my data from getting tied up in proprietary formatting that I'm not sure how the proprietors will maintain it. With everything else, I've also been trying to strip it down to the simplest appropriate standard.

But the fact is that no matter what I do, I still always feel as if my data is at risk. Call me data-paranoid I guess. Maybe I just lack a reliable system for maintaining multiple complete backups of my necessities, but no matter how many copies I have in however many places, hard drives crash and servers fail. The probability of losing an ephemeral data file seems infinitely more likely to me than destruction of the equivalent hard copy.

It occurs to me that an unimaginable amount of data is lost on a daily basis world-wide, but its impact seems negligible. I wonder how this global data loss compares to other historical losses.

The most catastrophic data loss for western civilization that I know of would be the fire at the Library at Alexandria. According to my layman's understanding, the Library at Alexandria was the largest collection of human knowledge at the time, the destruction of which may have contributed to the West's decent into the Dark Ages. This is the system crash to end all system crashes. So how much was lost?

I've tried to do some ballbark figures here. According to Wikipedia, the library's collection was between 500,000 and 1,000,000 scrolls. Who knows how much information was on all of these scrolls, but if they were all Torahs, they'd have 304,805 characters. According to my friend Kyle, I should figure about 1.8 bits per character (his citiation was Brown et. al 1992, I'll update with the full citation when I can). Here's the final equation:

500,000 ~ 1,000,000 undecorated scrolls ≈ 32 ~ 64 gigabytes.

That's a lot of data, but is also about equivalent to one really bad laptop crash! How many Alexandrias happen every day? Of course, an equivalent information catastrophe today would involve a proportional loss of information, and I'm not sure how to begin calculating the size of current human knowledge.

One more interesting tidbit: By character count, Wikipedia would be about 11,500 scrolls. The Wikipedia database size as of October 2006 (from here) was 4.4 GB, which translates to about 68,900 scrolls.