A few days ago Microsoft made the specification of the binary format available. Now Joel Spolsky provides a review of the specification including the famous leap year bug that made it in the OOXML spec.

If you started reading these documents with the hope of spending a weekend writing some spiffy code that imports Word documents into your blog system, or creates Excel-formatted spreadsheets with your personal finance data, the complexity and length of the spec probably cured you of that desire pretty darn quickly. A normal programmer would conclude that Office’s binary file formats:

* are deliberately obfuscated
* are the product of a demented Borg mind
* were created by insanely bad programmers
* and are impossible to read or create correctly.

You’d be wrong on all four counts. With a little bit of digging, I’ll show you how those file formats got so unbelievably complicated, why it doesn’t reflect bad programming on Microsoft’s part, and what you can do to work around it.

The Excel file format specification is remarkably obscure about this. It just says that the 1904 record indicates “if the 1904 date system is used.” Ah. A classic piece of useless specification. If you were a developer working with the Excel file format, and you found this in the file format specification, you might be justified in concluding that Microsoft is hiding something. This piece of information does not give you enough information. You also need some outside knowledge, which I’ll fill you in on now. There are two kinds of Excel worksheets: those where the epoch for dates is 1/1/1900 (with a leap-year bug deliberately created for 1-2-3 compatibility that is too boring to describe here), and those where the epoch for dates is 1/1/1904.

We will see how the BRM next week will fix the leap year bug for Open XML. The binary specification of MS Office is of certain importance for Open XML as ECMA, the submitter of the standard, justified a second standard: it has an alleged "high-fidelity backwards compatibility with the binary formats". However, only a few days ago the current specification was made publicly available. The implications would be that in 200 years someone can still implement the binary format to get access to doc files. I wonder what we need then OOXML for? Isn't the scenario that all users will convert their binary files to OOXML a bit unrealistic? And still no mapping is provided by Microsoft. No one can verify if the ISO standard candidate OOXML is better "backwards compatible" than the existing ISO standard, ISO 26300:2006. Applications are available to convert the old binary files to both formats.

Let's say Microsoft removes the specification of the DOC format from their website tomorrow. What do you do?

Excuse me for actually taking the side of MS here, but the specs were made public, and they are going to be out there for as long as anyone thinks there is a need to keep them around. MS cannot take this information back, nor can they revoke the open, unrestricted disclosure they just made.

The problem as I see it is that the specifications are so bad and incomplete. An implementer in 50 years will most likely not have all the background information required to understand the specification. As I see it, the reverse engineering efforts made by the OpenOffice.org project and others will be a more comprehensible source of information in many respects.

Still, this is a lot better than what we had before from MS, which was absolutely nothing (or, at least, nothing which was openly published and freely redistributable). Good start. More like this, please. If MS spends some more work filling in the holes in the current spec, we might see some real interoperability happen.

The answer why they don't want to put it as an annex to the DIS29500 is pretty given.

If they officially post it there the material would face ISO review. A review that most certainly would lead to that DIS29500 would be rejected since the "documenatation" is just a dump of help files that does only cover parts of what is needed.

By keeping the "documentation" of the legacy formats outside the formal ISO process they can both argue "the documenation is posted with rand-z terms and do thus not concern DIS29500" and try to dodge the question "Is the documenation sufficient and are everything about the legacy formats covered by the OSP"