Let’s Just Solve the Problem Month — July 2, 2012

I would like to declare November 2012 the very first Let’s Just Solve the Problem Month.

Here’s how it works, and what problem I want to solve.

As that sexy pontificator Clay Shirky has said on severaloccasions, instead of getting hung up on whether Wikipedia is great or not great, instead realize that Wikipedia represents a massive expenditure of energy recovered from not watching television. Not only that, but Wikipedia is one of what could be many different things happening that benefit the world. All you need is a dash of organization, a clear set of principles, and off you go.

I buy into this.

I also buy into the idea behind National Novel Writing Month, which has at its core that everyone has at least one (incredibly shitty, possibly unreadable, vogon-level-quality) novel inside them, and by setting aside one month of you being encouraged, forced, guilted and tortured, you will blow out one 50,000-word novel in that time. What happens next is up to you – burn it and move on, take it aside and polish it until you’re the next JK Rowling (or Hunter S. Thompson), or whatever tickles your fancy. But at the end, YOU WROTE A NOVEL BEFORE YOU DIED. Not bad.

What I know to be true is that there are a number of “problems” out there that need to be solved, that need one single thing to push them from “impossible” to “solved”, or, at least, “1.0”. And that thing that it needs is a lot of human thinking. Often rote, often boring, but necessary, to slam that thing out.

So since I got to come up with this idea, let me declare the first month, November 2012, to be SOLVE THE FILE FORMAT PROBLEM MONTH.

Here’s the problem, in more detail:

In the last couple centuries, we’ve created a number of self-encapsulated data sets, or “files”. Be they letters, programs, tapes, stamped foil, piano rolls, you name it. And while many of those data sets are self-evident, a fuck-ton are not. They’re obscure. They’re weird. And worst of all, many of them are the vital link to scores of historical information.

Everyone knows this problem. It’s why old novelists cry they can’t pull their first novel out of Wordperfect. It’s why someone who used U-matic tapes to record the first meetings of a famous protest group goes “oh well”. It’s why, in all things, someone looks at anything older than five years, and goes “bye”, figuring there’s nothing they can do.

And I’ve had to listen to the mewings about this problem for at least 20 years now, in various forms. A lot. And then the person lights up about maybe solving this problem, and then dims and says “well, we can’t really solve the problem”. Because they know – it’d take an army of people to do it.

Let’s make that goddamned army.

And before I give you a battle plan, let me say: This will solve a major issue. This will give thousands, later millions, access to a whole range of materials now shut off from each other. Stuff being made after 2012 will be scrutinized to see if it has made ways to access it clear. Stuff made before? We’ll have docs, or a thread, or even a few first steps towards understanding what it was. People writing modern software will be able to make filters or plugins that use these standards – it’ll drop from being a needless rathole to being a simple matter of writing a perl library or a javascript routine to pull the data in and make it work with the new thing. That will be very helpful indeed.

Battle Plan:

In October, I’ll be making noise about this happening. We’ll have a logo, and we’ll have some preliminary work done.

It’ll be a big wiki, with people taking various roles of the exciting and boring parts, working on a structure, yanking in what we need.

We’ll scour the internet, and online and offline worlds to pull in every potential format ever. If it sounds like a hierarchy issue, yes indeed it is… but classification’s bugbear is a distant second to acquiring the wealth of formats now extant.

We’ll acquire examples of the formats, links to programs that deal with the formats, known variations or problems with the format, and so on.

We’ll keep doing this from the low-hanging GIF and JPEG and PNG documentation, to the aforementioned piano rolls, microfiche, obscure barcode formats and disk layouts of Cray platters. We’ll just keep doing it.

At the end of the month, having had our knees on the chest of this problem for 30 days, we’ll be dragged off the problem, kicking and screaming and still punching, and see where we are.

The resulting work will be open-licensed and available to anyone.

Now, if you just read all this, let out a big “pffffff” and are having your fingers twitching with the urge to write about how this is all impossible, just get the fuck out now. The project doesn’t need you, now or ever. Just enjoy the summer, grasshopper, and come knocking on the ant’s door in December when we’re at 1.0.

But if you read this and said “Well, I could take a shot at it, might be worth a few hours”, then you’re EXACTLY what is needed.

Think what giving a month every year will do for a problem like this. There’s plenty of others – but this is one that has vital meaning to the work I’ve done with Archive Team and to the hundreds of archivists and historians I’ve met over the past few years. If this problem is in some way handled, if an OED of formats is blown out, lives will change – projects thought undoable will be doable, and the flood of old information saved will be incalculable.

So who’s with me? SEE YOU IN NOVEMBER.

Share this:

42 Comments

Well, I’ve sort of started already. I’ve been slowly investigating and implementing old archive formats for The Unarchiver, and I have the only open-source implementation of at least a handful of archive formats by now. Some are available only as source code and some I’ve written some documentation for here.

Not sure if I’ll have the time to help out further, but I might. I’m interested in seeing how it turns out, anyway.

Within the library and archives community there are two public format registries that are first steps towards dealing with this problem: PRONOM, from the UK National Archives, http://nationalarchives.gov.uk/PRONOM/Default.aspx; and (newly announced today), the Unified Digital Format Registry (UDFR), from the University of California Curation Center (UC3), funded by the Library of Congress, http://udfr.org/.

been working on this problem for about a decade
— started off with project gutenberg e-texts —
honing a _simple_ solution i’m about to take wide.

gonna use kickstarter to raise money so that i can
put all the source-code, etc., in the public-domain.
so if people would prefer to contribute a few bucks
rather than a few hours of time, i’d appreciate that…

I’ve been thinking about the file format identification and description problem since the late 80’s. Most recently, I think the breakthrough for me has been to recognize that the identity of a “format” (and its versions and variants) evolves over time, and that we can associate both formats, their specifications, and their implementations.

Actually, currently working on trying to establish an instance of Ontowiki with customized forms for entering data on file formats that will allow A. anyone to add information to extend the knowledgebase and B. maintain it in RDF/OWL behind the scenes so it will be processable by other machines (for things like DROID, JHOVE). This is part of the Preserving Virtual Worlds work and will require some refactoring of the original PVW ontology but we should have this up and running before November, so….

You know we have extensive documentation on file formats at the Library of Congress: http://www.digitalpreservation.gov/formats/index.shtml. And I see that Stephen Abrams has already shared the links for the UDFR format registry and the PRONOM format registry. And Jerry – the UDFR that launched this week is built on Ontowiki and supports data entry. You should touch based with Stephen.

I am delighted to hear from many friends old and new! Thanks for all coming out.

Yes, part of the tenaciousness and difficulty of this problem is that so many people are trying to solve it, and to solve it in a way of interest to that group, and with an eye towards their goals.

What I am proposing to do is to bring a white-hot laser of focus into this problem for 30 manageable days by a huge variety of folks. One of the things this project will do is drain all already-extent information about all formats into a wiki, and build links to a whole other range of materials. With luck, we can also have the incoming information formatted in a way that the other efforts could pull the information BACK into their little cubbyholes.

That’s the goal here – pour a lot of energy into this, get a real grip on as much of everything as possible, and drill down in every direction. I can’t predict how many people will go for it, but imagine 1000 people for 30 days.

[…] I wrote in excitement at Jason Scott’s call to arms to make November 2012 the month to “solve the file format problem“. While excited, I’m not quite clear yet what the problem is, and suggested some […]

I gather you’ve seen my blog post so you know I’m excited about this. I’ve written lots about aspects of this problem in the past, too, and have even tried (and failed) to do something about it. I’d really like to contribute, the question is, how do we do that? Where do I sign up? What can I do (not been a coder for almost 20 years, so that’s not much help)?

Great idea! I feel like I’m a broken record when I warn artist friends that their art could be locked inside a format that no app will decipher in X number of years and blah blah blah, open formats, blah blah. No one listens. No one but Jason Scott, that is!

I’ll definitely be interested in taking part in this effort, even if only in a small way. I’ve been writing tools to pull data out of files in obscure formats on a non-mainstream computing platform (RISC OS) for over a decade, and information about these needs to be collected somewhere. I can at least start by collecting some links together.

There were potentially quite a few documents created in undocumented, proprietary formats on this platform, many by teachers and schoolchildren because of the widespread use of the platform in schools. It would be good to recover some of that history before it is too late.

LOVE this so much. I teach Digital Preservation for NYU’s Moving Image Archiving and Preservation MA program. I plan on incorporating JUST SOLVE THE PROBLEM into the syllabus for this November, and get the students contributing! In class, out of class, final assignments, etc. I can’t wait. If there’s anything in particular you want 10-15 students to work on, starting in September, please let me know!

I’m a big fan of an experiment to “drain all already-extent information about all formats into a wiki”. I’m still a bit confused about what the problem is though. Is the problem identifying the file formats, using the file formats, or both? It seems to me that the second problem isn’t very tractable. I’m glad others have chimed in with data sources that would be good to drain. There is also the Wikipedia Digital Preservation Project which aims to improve the content of articles about file formats including their use of the File Format Infobox. I know you aren’t the biggest fan of Wikipedia, but have you thought at all about pointing your laser in that direction?

+1 for mailing list; it’s a long time between now and October and attention spans on the Internet are notoriously short. Some kind of reminder in my inbox (if not full-on discussions) would be very helpful.

What I would love: a file explorer extension (both web browser based and native(-ly wrapped) to Windows, MacOS, Linux) that not only gives me human-readable information on the file format, but also gives me user-friendly functionality (that is of course continuously improving, based on some on-line database/wiki):
– An improved “open with” could give me not (only) installed programs that have registered themselves within the OS file associations, but also web services and OS-native open source programs ready for download. Again of course not just providing a link and leaving it to the end user, but handling the hassle for him/her.
– Extra columns in this filebrowser providing extra information. If it’s XML, is it well-formed? If it’s HTML, is it valid? Does it have dead links? Is it mobile-ready? Is it accessible? Will it give problems in certain browsers? For SVG, does it have semantics? For text, what encoding, what language, what’s the % of spelling errors? You see we need some plug-in scheme here. Plug-ins referring back to OS-native and web services. Flexible, not always running all the checks, not locking on slow stuff (either local or not).
– Extra filetype analysis tools, as the extension isn’t always enough.

Let’s not only fix the problem, also to the end user make it easier to not grow the problem than to do. Provide super user-friendly functionality for open formats and provide some functionality for closed formats too and this will even direct the lazy end user who doesn’t (want to) know about the file type problem in a positive direction.

Not much of a coder but am one hell of a tester. I can bring any file or system to it’s knees. Just keep pushing the limits until something hangs and then go looking at the logs! This proprietary or otherwise file issue is a pet peeve. I will watch this site and contribute if and when I can. Go get em folks.

I know of at least a few physical object formats that nevertheless have computer data, namely, Casio ROM packs and Yamaha Playcards. These were music cartridge/card formats introduced in the early 80’s by their respective manufacturers. Apparently the Casio Rompack patent provides loads of information, while the Yamaha Playcard patent provides no information about the encoding of the data on the magstrip, although it seems to be a simple AFSK system, however, I don’t know of anyone who knows any more than that. The Yamaha Playcard system never had a writeable version, so only a finite number of distinct cards exist. Casio RomPacks, as I understand it, did have a recordable version, although it was probably backed up by battery.

Yay! GO GO GO!
I’ll try to add some stuff about an old DOS based tape backup system called Central Point Backup, with which I recently had to restore some 200 DDS2 tapes (I had to build a DOS based, networked 386 for that, it would have been so much more convenient if there had been a way to do it on windows/unix)

[…] the Jason Scott November File Format month of action comes closer [update: original post here and wiki page here], and also as I wrestle with trying to access some 50 or so Powerpoint 4.0 files […]

[…] for the latest state) was to see what I could learn from it, with half an eye on the Jason Scott November month of action on file formats, see also planning here). So, what did I learn that might be of more general interest than the […]

Jason,
This is a great idea. Full disclosure: I make my living doing some of the things you’ve talked about (figuring out old tape, disk and file formats). So there’s a personal interest in keeping some stuff close to the vest*. But I do have spasms of altruism and would like to contribute where I can without strangling my cash flow. I’ll bet there are a number of elder-geeks like me that have tidbits stashed away, currently out of google-reach. This type of crowd-sourcing could go a long way.
Chris

Chris, there will always be money in transferring and collating/sorting/classifying old data for people, enterprises and other organizations. None of this eats into that. In theory, it increases it because being aware of all the file formats means that no disk is worthless because it has stuff from THAT application, the one nobody can read from.

Jason, disagree and agree. Content analysis and conversion is a bigger part of what I do than is media conversion. One small example: over the past few years I’ve gotten some tapes with files in the oldest mainframe versions of SPSS (70’s, early 80’s), not compatible with current versions. No one had been able to convert, and nobody had funding to support immediate efforts. Contacted IBM, who now owns SPSS, and even the professor who originally created it–he didn’t have any documentation or significant memory of it. So over the months I spent many hours puzzling it out, finally wrote a converter. Now, a few times a year, an academic will run across some of this old stuff and I make a modest amount converting it. I agree the knowledge shouldn’t go to waste, and there are certainly others like me that can contribute, but it means giving something up.
Now for the agreement part. There are bound to be other situations where the information is extremely unlikely to produce revenue. Also, some of the work done by folks like me is funded by NSF and similar organizations. It might be helpful to urge funders of work that involves converting arcane stuff to require that special knowledge gained during the project (and perhaps source code) be made available to the public. In that case it could be fair for them to pay a bit more for compliance.

Of course, arcane media/backup/file-system formats are also an obstacle to accessing legacy data. Proprietary backup formats, for example, such as HP’s fbackup which isn’t compatible with other Unix variants. A whole bunch of optical disk formats were created in the early days by vendors that no longer exist. Do any of these format-info repositories cover these things?

One of the factors that puts data at risk is the very thing we’re discussing here: the format is no longer understood.

More about DARI for anyone interested…

CODATA, the Committee on Data for Science and Technology, is an interdisciplinary Scientific Committee of the International Council for Science (ICSU), was established 40 years ago. Among its groups is the “Data at Risk Task Group” (DARTG). See http://ils.unc.edu/~janeg/dartg/

A major goal of DARTG is to create an Inventory of data that are at risk, and whose unique scientific information is in danger of being lost to posterity. (The Inventory will become the foundation for a Phase II project to design a series of missions to rescue that information.)