The Metadata Mania — June 26, 2011

OK, now that I’ve gone pro with the archiving and I’ve been at it pretty hardcore for a few months, my programmer brain has kicked in and I’m trying to find inefficiencies and kill them so we end up with a ton of cool stuff online.

I already know what I am and what I’m about with all this. I’ve known for sometime: I’m the weird little widget, the fucked-up little strange item that is going to link a lot of people who haven’t linked up before. That said, there’s always the danger of being called, by each set of people, someone who is a poor replacement for them. And I am! I am an enjoyable archivist but not a full replacement for a “real” archivist and I’m a home computer enthusiast but there’s some people who truly pants me over compared to what I know. But my mission and forseeable future will be spent linking all these sets together, and that’s how it’s going to be.

So now that we understand each other, let’s get to work. Let’s get to work, in fact, on the biggest single problem I’ve found.

It’s metadata. Metadata is the slowdown.

If you don’t really know what metadata is, that’s cool, it’s a kind of weird concept. In simple terms, metadata is information about other data. If you have a pile of Apple II Floppy Disk images, then the fact they’re Apple II floppy disk images is the metadata, as well as what’s on them, when you transferred them, who owned them… anything that’s not the data itself is the metadata.

See, it’s not that hard to add lots of data somewhere – see cd.textfiles.com where I have three million files, and yes I am going to port cd.textfiles.com to archive.org and yes I am going to be adding scans of the discs themselves and the ISOs and oh yeah, that’s right baby we’re going big with that. But the three million files, left alone, wouldn’t be interesting or more accurately would be very interesting but be flooded by all the other files not in your immediate interest path for whatever you’re looking up.

Metadata, you see, is really a love note – it might be to yourself, but in fact it’s a love note to the person after you, or the machine after you, where you’ve saved someone that amount of time to find something by telling them what this thing is. Lives have been absorbed getting metadata, and so there’s an entire field of computer study about this idea, and making your machines do the hard work for you. Google’s got some interest in this, I heard. If you could completely generate Metadata, life would be pretty awesome. But you can’t. Not really. Not completely.

So let me announce another collection I’m working on. The Bitsavers Archive. Yeah, that’s right, that one. The one that people have been scanning, donating, and working on for over 15 years. I’m going to import it into archive.org. I got permission from them and we’re going in.

But again, Metadata becomes the immediate issue. I’ve written a script that lets me point at a bitsavers asset, say An Apple I User Manual, and type in a title and then the description and date for the item, and then the script does the rest – upload it, generate a metadata .xml file the archive.org system uses, and check in the item so it can going the collection. Truly, fire and forget. Everything automated is now automated. But the fact remains, I had to title the item, and then write a description. That’s the big holdup.

See, bitsavers.org has 19,000 items in its collection. They are not described in the manner they really need to be to be useful. Someone, me or other people, will need to describe them, before they should be checked in to archive.org. Sure, I could slam all 19,000 in WITHOUT descriptions, but then very little has been accomplished – imagine stacks of a library where all the back of the books have been covered with white-out. Not useful at all.

Similarly, the arcade manual archive I created to test out my scripts and acquisition approaches is a wild success, except in numbers. Of the 362 manuals in there, though, they’re all described nicely, thanks to a team of volunteers who stepped forward to do the reading and typing necessary. They’re credited in the front page for the work they did. Notably, though, I have 4,000 more manuals to add. I’ll be adding them as I go for a little while, but it’s just going to take a huge amount of time, even with every step but “describe” automated.

So, it basically comes down to me asking you, if the idea interests you, to sign up to be a metadata warrior, someone who will work with me to describe these items. I’ll help you find something personally interesting – some people really dig reading old computer manuals, others care about arcade stuff, and I’m sure there’s even more in what I’m adding to keep your attention up. I’m not asking you to do everything – even if you add a handful of stuff, that’s more than was there a week previous – and that’s helping.

And in case the thought occurs to you, Archive Team is not really the best thing for this – this is about long-term presentation, not saving burning data from assholes. When you help me with metadata, we’re helping make available really cool stuff that has been saved but which needs a nice tag on the outside so later people and generations can know what’s in there.