How to hire Guillaume Portes

You want to hire a new programmer and you have the perfect candidate in mind, your old college roommate, Guillaume Portes. Unfortunately you can’t just go out and offer him the job. That would get you in trouble with your corporate HR policies which require that you first create a job description, advertise the position, interview and rate candidates and choose the most qualified person. So much paperwork! But you really want Guillaume and only Guillaume.

So what can you do?

The solution is simple. Create a job description that is written specifically to your friend’s background and skills. The more specific and longer you make the job description, the fewer candidates will be eligible. Ideally you would write a job description that no one else in the world could possibly match. Don’t describe the job requirements. Describe the person you want. That’s the trick.

So you end up with something like this:

5 years experience with Java, J2EE and web development, PHP, XSLT

Fluency in French and Corsican

Experience with the Llama farming industry

Mole on left shoulder

Sister named Bridgette

Although this technique may be familiar, in practice it is usually not taken to this extreme. Corporate policies, employment law and common sense usually prevent one from making entirely irrational hiring decisions or discriminating against other applicants for things unrelated to the legitimate requirements of the job.

But evidently in the realm of standards there are no practical limits to the application of this technique. It is quite possible to write a standard that allows only a single implementation. By focusing entirely on the capabilities of a single application and documenting it in infuriatingly useless detail, you can easily create a “Standard of One”.

Of course, this begs the question of what is essential and what is not. This really needs to be determined by domain analysis, requirements gathering and consensus building. Let’s just say that anyone who says that a single existing implementation is all one needs to look at is missing the point. The art of specification is to generalize and simplify. Generalizing allows you to do more with less, meeting more needs with fewer constraints.

Let’s take a simplified example. You are writing a specification for a file format for a very simple drawing program, ShapeMaster 2007. It can draw circles and squares, and they can have solid or dashed lines. That’s all it does. Let’s consider two different ways of specifying a file format.

In the first case, we’ll simply dump out what ShapeMaster does in the most literal way possible. Since it allows only two possible shapes and only two possible line styles, and we’re not considering any other use, the file format will look like this:

Although this format is very specific and very accurate, it lacks generality, extensibility and flexibility. Although it may be useful for ShapeMaster 2007, it will hardly be useful for anyone else, unless they merely want to create data for ShapeMaster 2007. It is not a portable, cross-application, open format. It is a narrowly-defined, single application format. It may be in XML. It may even be reviewed by a standards committee. But it is by its nature, closed and inflexible.

How could this have been done in a way which works for ShapeMaster 2007 but also is more flexible, extensible and considerate of the needs of different applications? One possibility is to generalize and simplify:

Rather than hard-code the specific behavior of ShapeMaster, generalize it. Make the required specific behavior be a special case of something more general. In this way we solve the requirements of ShapeMaster 2007, but also accommodate the needs of other applications, such as OpenShape, ShapePerfect and others. For example, it can easily accommodate additional shapes and line styles:

This is a running criticism I have of Microsoft’s Office Open XML (OOXML). It has been narrowly crafted to accommodate a single vendor’s applications. Its extreme length (over 6,000 pages) stems from it having detailed every wart of MS Office in an inextensible, inflexible manner. This is not a specification; this is a DNA sequence.

The ShapeMaster example given above is very similar to how OOXML handles “Art Page Borders” in a tedious, inflexible way, where a more general solution would have been both more flexible, but also far easier to specify and implement. I’ve written on this in more detail elsewhere.

Here are some other examples of where the OOXML “Standard” has bloated its specification with features that no one but Microsoft will be able to interpret:

This element specifies that applications shall emulate the behavior of a previously existing word processing application (Microsoft Word 95) when determining the spacing between full-width East Asian characters in a document’s content.

[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

(This example and the following examples brought to my attention by this post from Ben at Genii.)

What should we make of that? Not only must an interoperable OOXML application support Word 12’s style of spacing, but it must also support a different way of doing it in Word 95. And by the way, Microsoft is not going to tell you how it was done in Word 95, even though they are the only ones in a position to do so.

This element specifies that applications shall emulate the behavior of a previously existing word processing application (Microsoft Word 6.x/95/97) when determining the placement of the contents of footnotes relative to the page on which the footnote reference occurs. This emulation typically involves some and/or all of the footnote being inappropriately placed on the page following the footnote reference.

[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

Again, in order to support OOXML fully, and provide support for all those legacy documents, we need to divine the behavior of exactly how Word 6.x “inappropriately” placed footnotes. The “Standard” is no help in telling us how to do this. In fact it recommends that we don’t even try. However, Microsoft continues to claim that the benefit of OOXML and the reason why it deserves ISO approval is that it is the only format that is 100% backwards compatible with the billions of legacy documents. But how can this be true if the specification merely enumerates compatibility attributes like this without defining them ? Does the specification really specify what it claims to specify?

The fact that this and other legacy features are dismissed in the specification as “deprecated” is no defense. If a document contains this element, what is a consuming application to do? If you ignore it, the document will not be formatted correctly. It is that simple. Deprecated doesn’t mean “not important” or “ignorable”. It just means that new documents authored in Office 2007 will not have it. But billions of legacy documents, when converted to OOXML format, may very well have them. How well will a competing word processor do in the market if it cannot handle these legacy tags?

So I’d argue that these legacy tags are some of the most important ones in the specification. But they remain undefined, and by this ruse Microsoft has arranged things so that their lock on legacy documents extends to even when those legacy documents are converted to OOXML. We are ruled by the dead hand of the past.

This element specifies that applications shall emulate the behavior of a previously existing word processing application (Microsoft Word 5.x for the Macintosh) when determining the resulting formatting when the smallCaps element (§2.3.2.31) is applied to runs of text within this WordprocessingML document. This emulation typically results in small caps which are smaller than typical small caps at most font sizes.

[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

You’ll need to take my word for it that “This emulation typically results in small caps which are smaller than typical small caps at most font sizes” falls well short of the level of specificity and determinism that is typical of ISO specifications.

Further:

2.15.3.51 suppressTopSpacingWP (Emulate WordPerfect 5.x Line Spacing)

This element specifies that applications shall emulate the behavior of a previously existing word processing application (WordPerfect 5.x) when determining the resulting spacing between lines in a paragraph using the spacing element (§2.3.1.33). This emulation typically results in line spacing which is reduced from its normal size.

[Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

So not only must an interoperable OOXML implementation first acquire and reverse-engineer a 14-year old version of Microsoft Word, it must also do the same thing with a 16-year old version of WordPerfect. Good luck.

My tolerance for cutting and pasting examples goes only so far, so suffice it for me to merely list some other examples of this pattern:

lineWrapLikeWord6 (Emulate Word 6.0 Line Wrapping for East Asian Text)

This is the way to craft a job description so you hire only the person you earmarked in advance. With requirements like the above, no others need apply.

As I’ve stated before, if this were just a Microsoft specification that they put up on MSDN for their customers to use, this would be par for the course, and not worth my attention. But this is different. Microsoft has started calling this a Standard, and has submitted this format to ISO for approval as an International Standard. It must be judged by those greater expectations.

Update:

1/14/2007 — This post was featured on Slashdot on 1/4/07 where you can go for additional comments and debate. I’ve summarized the comments and provided some additional analysis here.

I don’t see this as in any way making the standard compatible with legacy documents. The Word application may be compatible with legacy documents, but these attributes added to the OOXML are merely a ruse. All they say is “Be compatible!” like it was a wizard’s spell, but they don’t actually tell an implementor what the necessary behavior is to be compatible with, say, Word 95 East-Asian character spacing. All the intelligence remains in the application, and little is disclosed to the file format.

They might as well named the “footnoteLayoutLikeWW8” attribute “LovePotionNumberNine”. It would have been just as informative. The fact that it has a name that sounds like it is compatibility-related does not make up for the fact that the OOXML specification does not actually tell you how it behaves.

“Considering the requirement that the standard allow for compatability with existing documents, what would you suggest?”

I would suggest writing an application plugin for those legacy versions of MSOffice apps, and, perfecting – through the plugin – roundtrip conversions with a truly open, platform independent universal XML file format such as ODF.

This is exactly what the OpenDocument Foundation’s daVinci plugin does, except we can only go as far back as MSOffice 97. With Microsoft’s expertise and access to years of binary encoded secrets, there is no reason why they couldn’t perfect a plugin that covers Word 95 or even WordPerfect 94 if that’s what the market demands.

Let the plugins do the dirty work of native in-memory-binary representations to XML and back conversions.

Microsoft has been prescient enough to provide us with a plugin architecture that allows for native file format conversions. So enough of the nonsense. Let’s make use of their prescience. The promise of XML and the age of collaborative computing is far too important for us to make the kind of mistake with our legacy information and information processes that EOOXML beckons and snares us into.

MS Office provides a compatibilty options page. That requires compatibility information to be present in documents. However it seems in full compliance with OOXML if a new application automatically converts all depreciated content into valid up to date OOXML content just as long as this is stated in it’s conformance statement.

The specific examples you cite involve rendering the document.It isn’t really possible to render the depreciated features of OOXML using the OOXML standard. However it isn’t really possible to render virtually any feature of both OOXML and ODF based on the standards. These Office standards are not really good for rendering the content exactly the same within every application. They aren’t like printformats such as PDF or XPS.

This is all nice and well, but in reality few people have a need to absolutely FAITHFULLY convert a file written by a 16 year old application. Those who really, really, really do, will find practical ways to do so. I would suggest to let MS implement this crap themselves. The manpower they will need to do it and the resulting binaries will make their next Office product late, buggy and absolutely useless in size and functionality. The greater risk to Open Source Document solutions is that MS actially finds someone who actually gets that and throws all this crap out. That manager could re-start the whole Office Suite from scratch to create something that is not a dinosaur on its way to meet the asteroid. Until that happens, MS is just happily nailing its own coffin shut.

Imagine you are a vendor who is writing an OOXML-compatible word processor. Microsoft is only one. We’ve heard that Corel is adding OOXML support to WordPerfect, and that Novell is adding it to OpenOffice. So, we may have a world with multiple OOXML-supporting applications.

When WordPerfect receives an OOXML document, it has no idea if it was created from scratch in Office 2007 or whether it was an older Word 95 document that was converted to OOXML. The person opening the document may not know its provenance. All they know is whether, when loaded in WordPerfect, it displays correctly or whether it looks like crap.

The ability for WordPerfect or OpenOffice to correctly display all OOXML documents is essential for their viability as products. Why should it be any less important to them than it is for Word?

Microsoft supporters seem to be walking an uncomfortably fine line here, saying that OOXML is necessary as a standard because only it is 100% compatible with legacy formats, while at the same time saying that it is not a problem that these legacy features are only mentioned in passing, but not really sufficiently documented in the specification. I don’t think you can have it both ways.

Considering the requirement that the standard allow for compatability with existing documents, what would you suggest?

Then don’t require that in a standard. First of all: OOXML is not a description of the older .doc files, since they were binary, and OOXML is an XML based format. A translation is needed anyway, and then you might as well encode the old visual layout with a modern and portable description. Same thing with the much publicised leap year issue in 1900 (According to MS 1900 is a leap year. According to Gregorius it isn’t).

The claim that OOXML must be backward compatible is nonsense, since there are no documents to speak of.

There are two valid options, which I personally feel would work for Microsoft.

1. Describe the behavior in the standard. Stating that there is no room in a document with no length limit is a cop out – one that I’d personally like to see Microsoft called on.

2. Utilize multiple elements to achieve the same goal, with the behaviors of each of those multiple elements limited to a describable subset of the desired behaviors.

I really don’t think the second one is appropriate to most of these – they are, after all, about emulating broken behaviors. If I were editing a document which used those tags, I’d most likely be most anxious to find a way to turn them off – and if that meant opening it in OpenOffice.org, which didn’t know what to do with them and so ignored them – great. It works.

I think a bigger concern comes with Microsoft’s patent release – at least the first version read that it would only apply to ‘conformant’ attempts to follow the standard. Not handling these items in the proper way could be seen as being non-conformant.

… but in reality few people have a need to absolutely FAITHFULLY convert a file written by a 16 year old application.

This is a nonsense assertion. Medical records, long term leases, 20 year depreciation documents, military equipment specifications are just a limited set of many, many examples of essential documents which must be rendered faithfully yesterday, today and probably longer than many of us will live.

I can cite you actual locations, companies, govermental agencies where they are having problems today with accurately rendering even 5 year old documents. For example, Picatinny Arsenel in NJ houses all of the maintenance specifications for all military guns and tanks used by the US Army. This runs to literally hundreds of thousands of documents. This documentation center is the primary source for all maintenance documents for all equipment still in use. Some of this active use equipment dates back to the 1950s and 1960s and still requires maintenance.

Microsoft telling people to “be compatible” is meaningless and besides the point. Microsoft themselves has not maintained compatibility with these still active use documents. MS should have it’s feet held to the fire to produce actual documentation explaining how to be compatible.

If you want to claim there’s no reason to maintain compatibility with ’16 year old programs’, all you’re doing is proclaiming to the whole world your ignorance of essential long term document needs.

Really.

Think of people’s medical histories. Many people certainly live longer than 16 years. So do their medical histories.

Then there are 99 year business leases for properties worth literally millions and hundreds of millions of dollars. Real property depreciations are done usually on a basis of 20 years. Interstate highways have been constructed with a designed replacement frequency of 20 years. All associated documentation including engineering plans must be kept for much longer than that. Consider the bridges of NYC, some of which are more than 100 years old. The original engineering plans for them are still as relevant today as the day the were drawn up. These are all working documents. Multiply the number of such documents required for these structures by the number of such structures present in the US and you start talking multi-billions if not trillions of documents.

Heck, even mortgages are going to 40 years. I don’t know about you, but I have a 30 year mortgage myself. How many home owners have such mortgages? Last time I refinanced, I had almost 100 pages worth of documents to sign, all electronically generated. Once again, we’re talking multi-billions of pages of past and existing documents.

ODF was designed and implemented by people who understand the importance of being able to accurately render documents not only with programs 16 years old, but also programs which have created documents which are much older than that such as Multimate and Wordstar.

There are millions and millions of essential legal, medical and maintenance documents dating over 20 years old which still require active access as of today. What percentage of them have been created by Word 4 and Office 95 are unknown, but it certainly isn’t insignificant. Many of these documents literally involve life and death matters when it comes to people’s health and maintenance. the few people who do is not a small absolute number. If you stop and think about it, the absolutely number who directly need it probably numbers over several million in the US alone with nearly the entire population indirectly dependent on the ability to accurately render old documents.

Nearly everyone has health records of one sort or another. That’s just about every single person in the US. Are you prepared to state that most of the US population does not have health records somewhere in older electronic formats? I don’t know even how to make an accurate determination of that. Much less am I willing to put other peoples lives on the line and make such an assertion.

Please. ODF was designed to specifically address these issues today and in the future. As far as I can see, OOXML does an incomplete job at best and a negative job at worst. If OOXML can’t accurately describe how to maintain backword compatibility with it’s own prior document formats, then it’s not a useful standard. In fact, it’s not a standard at all. It’s not even worth value as toilet paper.

I will not trust my future health and safety to a standard which doesn’t even pay lip service to defining how to maintain compatibility with it’s own previous formats.

I’ve dealt with legacy electronic versions of all the document categories I’ve mentioned. Some of these documents have been in Word 4 and Office 95 format. Sure, trillions of worthless and/or ephemeral documents are created every month. But the minute percentage of essential long term records is still an incredibly huge number of documents which effect nearly everyone if not directly than indirectly.

The truth of the matter must go something like this: Microsoft had some funky formatting code in WordX (where ‘X’ is old). Then they wrote Word(X+1), they couldn’t be bothered to figure out how WordX did things – so they just cut and pasted that code into the new version. As we got to Word(X+2), Word(X+3) and so forth, this ugly little wart of code became less and less well understood. Since Microsoft are renowned for not documenting the internals of their code – you quickly arrive at a point where nobody understands this code – or even knows what it does under all circumstances.

Hence they don’t KNOW how to convert those documents into XML – so they stick in an ugly tag that tells THEIR software: “Just go and use that ugly old legacy code for this bit of the text”.

They CAN’T document what it is that the XML tag does because they themselves don’t know. If they can’t figure it out – neither will we – but for them, it doesn’t matter because they still have that old code sitting there in their shiney new application.

guy, do you really believe that because a modern application uses slightly larger line spacing or slightly larger small caps medical records will become useless? Faithful rendering at that level is not necessary.

Guy, what you’re describing is the work of document conversion, not document specification. For example, some of those engineering diagrams you describe may have measurements in inches, and an engineer ordering a part may find that the supplier lists the part sizes in millimetres. So he simply performs a conversion.

For the purpose of their description to a computer program, documents (including real estate leases and medical records) consist of images, text (which is really just a special subset of images, frequently repeated), and information stating where each image is to be laid out on the page.

It does not matter how a three-inch left margin with a ruled line was decided on; it will be sufficient for our purposes to replicate it using the methods of the new standard.

The document standard needs to describe what goes where. No how, and most definitely, no why. What you, and Microsoft, are losing track of is the fact that the conversion need only be one-way. To convert patient records from a 1987 AppleWorks database file into a CSV file might be difficult, but it can be done. If all else fails, it can be done by the expedient of printing it out on the ImageWriter attached to the surgery’s Apple //e and re-typing it into a modern program. We do not need to convert anything back.

Further, in very few of these cases will the actual physical appearance of the document be important. If it were, I would argue that the purpose is best served by retaining a scanned image of the document, and “tagging” that image with a transcript of the text it contains. For handwritten documents one would have to retype it; otherwise, human-checked OCR is probably good enough. Even if some aspect of the document’s appearance is relevant, it can be tagged as such. For example, if it is relevant whether the document was typed on pink or yellow paper, the document can have a “Paper Color” field, in which “Pink” or “Yellow” are acceptable entries. Or, more sensibly, the pink or yellow paper can be treated as background shading, or an underlying image, or something.

My point is this: the best legacy document compatibility layer is the human brain. You don’t need to plan out and allow for every possible conversion task in advance.

While I can understand the issues associated with Microsoft’s supposed open format I just don’t see the problem with this. As another poster pointed out this only concerns the rendering of the document. The contents of the document are unaffected. All those documents out there (if somehow magically converted to OpenXML) will not lose content. Expecting an exact dot-for-dot printout that you got 16 years ago without using the same version of the program you created the document with is just silliness. You can’t even get that using Word with old documents.

Picatinny Arsenal, business leases, and NYC bridges are all lovely examples: are you perhaps suggesting that no distiction is possibile between the -information content- and the -presentation format- of a document?

If you encounter applications that require blind fidelity to visual appearance without regard to information content, scan the old printed page in, with flyspecs, erasures, staple holes, and all, visible at 1200 dpi or better.

Is this an improvement over old Mil-Spec microfiche?

Or is OpenXML more about representing information in a portable, and compatible, encoding alphabet?

Premise: Information content can be accurately reproduced, and translated to another representational format, independently of the legacy application previously used to encode it in a proprietary format,given the legacy decoding rules used by that format.

Guy, your “think of the children^H^H^H^H^H^H^H^Hmedical records” argument doesn’t make sense to me:

* Who is keeping medical records in Word format?* Aren’t only the signed and printed copies of contracts valid? Why are you using the Word copy?* Why do the footnotes need to be reproduced pixel for pixel?* If current versions of Microsoft Word can’t reproduce old Word documents, what makes you think other implementors will ever be able to do so, even if Microsoft were to cooperate?* If you absolutely need perfect conversion, why can’t you use the original software to convert it to PDF? It only needs to be done once.

Guy, if one were to try to represent a legacy document in a properly designed document standard, one would reinterpret all of the weird quirks of the document into the new format. So, for example, instead of setting a flag makeFooBarCompatibleWithBaz1.0, figure out how to adjusts the fonts/margins/whatever to emulate that behavior without needing to make every program accessing the document handle the behavior right. Then the weird legacy hacks need only be done once – in the convertor, not in every program accessing it later. If you encode things as Microsoft has done directly into the spec, you have the risk that you’ll only have one program which correctly handles these ‘converted’ documents – and Microsoft might decide to stop implementing these ‘deprecated’ elements eventually. If this happens, you’re truly screwed. If you need long-term archival, remove the legacy issues *now*, before it’s too late.

If the specification of this behavior “cannot be faithfully placed into narrative for this Office Open XML Standard,” then Microsoft must separately publish the specification of that behavior for each application in the OOXML standard.

If it is a real standard, then it must be fully specified.

Of course, they don’t want a real standard. They want to maintain an advantage, and they’re trying to do that by leaving holes in the spec.

There is no need to include features from 16 year old (or any age) applications in a new standard. If you want to convert, you convert. If WP6 linespacing is 0.8 of Word2007 linespacing, you write linespacing =”0.8″ in your converted document. You DON’T writeuseWP6linespacing/linespacing =”1″/

That is just plain silly.That is making a specification unnecessary large for instances that are rarely used by the general public.As said: if you want to convert, than use a conversion tool. Do not use a modern specification to hold all legacy features.

There’s always going to be a problem when converting to a new XML format from multiple, legacy formats, each with their own set of quirks. It seems to me that Microsoft had one of two options.

Option 1: Make the XML format clean and simple, and put the work into the converters that convert each legacy format into the new XML format, dealing with all of the quirks along the way.

Option 2: Complicate the new XML format by adding markers that specify the legacy behaviour. This makes the conversion task much simpler, but adds bloat to any application that wishes to fully implement the new XML format.

It’s quite clear that Microsoft have taken the second approach. Perhaps there are good technical reasons why this was easier for them to do (codifying the legacy behaviour in a clean XML representation may be much harder than simply requiring an implementing application to exhibit each of the legacy quirks), but it also results in a far less clean XML format.

Any word processing engine that has to deal with all of these legacy rendering quirks is inevitably more complex (and presumably more prone to bugs) than one which does not. This is less of a problem for Microsoft, as they already have an engine that copes with the quirks, but it definitely increases the difficulty for any other application that wishes to support the new format, especially when only Microsoft has the full knowledge of most of the quirks. I see little prospect of them being more fully documented by Microsoft. The only glimmer of hope is that ISO requires it before ratifying OOXML, but it may be a moot point by the time we get that far, if Microsoft have already won market share.

You are even completely forgetting about universities and their libraries. Think about all the papers and theses that get electronically published.

Most of the time something like Word doesn’t even enter the equation and it is PDF (which is not the best way forward either, but at least the specification has been out in the open for a long while now).

Actually Office backward compatibility it is very problematic, i work at local university and i had a lot of problems with powerpoint 2000 presentations not working at all in powerpoint 2003, while loading the file it said the file was corrupted ;)

Fortunately no-one in their right mind is going to want to use WordML to preserve the formatting detail at that level. It’s just barely acceptable as an export format so that the text can be converted to something more sensible, nothing else. All the rest is window-dressing to satisfy the egos in Marketing.

I seriously doubt that for the long-term records, you need to preserve such minute details as inter-word spacing or placement of footnotes. If you need to preserve the typographical details, you should use PDF instead of a word processing format.

Even if you assume that preserving such information is essential, there is no doubt that the standard must precisely specify the semantics of such backward compatibility switches and their interactions, otherwise it’s not a standard.

Guy, you completely miss the issue. NONE of the document categories you list need to be absolutely faithfully reproduced. I.e. it does not matter if the footnotes are a bit larger or smaller, or if the spacing is a bit off, or if the fonts are a bit different, as none of it changes the meaning of the document.

If details like that DOES change the meaning of a document, then a word processor file is inappropriate as a storage medium, as it doesn’t document the visual layout nearly precisely enough – print a Word document on a machine with different printer drivers, different fonts etc., and there _will_ be variations in the output.

… military equipment specifications are just a limited set of many, many examples of essential documents which must be rendered faithfully yesterday, today and probably longer than many of us will live.

Don’t know about medical records or legal records, but US military records used to be encoded in SGML formats specified in MIL DTD’s. These captured the STRUCTURE of the document, not the PRESENTATION, as the latter is too dependent on the application. It’s clear that very few people understand that distinction, although the move to XHTML and CSS suggests that some have.

This is a nonsense assertion. Medical records, long term leases, 20 year depreciation documents, military equipment specifications are just a limited set of many, many examples of essential documents which must be rendered faithfully yesterday, today and probably longer than many of us will live.

You are right that a plethora of documents are in old formats, but what you do not consider is the point that most of these documents are actually formatted in a very simple way, meaning that your emphasis on document format is only appropriate in a few cases (relatively).

Medical records (at least the ones I have seen) normally come in the form of a heading stating the procedure/problem, a date, a number of bullets stating what has been diagnosed, what has been done and what has been prescribed.

These records are full valid and understandable even without any other formatting than keeping the line breaks. (In fact most of them have no other formatting!)

I am sure you can find a number of older documents that really are dependent on formatting to the older rules, but these, I would hazard to guess, would be few and far between. They do in fact have a problem already, as you would have to fiddle with the settings of Word to make it compatible — especially if you do not know the original format.

Apart from that, I would still like Microsoft to have used other, more generic specifications to create the same results, like explicitly specifying each non-standard feature with absolute metrics (like word-space = 0.21pt).

This is a nonsense assertion. Medical records, long term leases, 20 year depreciation documents, military equipment specifications are just a limited set of many, many examples of essential documents which must be rendered faithfully yesterday, today and probably longer than many of us will live.

What’s the problem here? They need to be rendered correctly, not edited and updated. All this requires is for Microsoft to write a pdf converter for these old documents which will render them as desired. To edit them, they could be required to undergo a one-way transformation which would preserve layout as much as possible.

Isn’t this approach much more preferable than the pain of allowing people to edit legacy documents?

The thing that I find hilarious about this is that I’m betting MS couldn’t specify what “auto-space like Word 95” meant even if it wanted to.

As I imagine it, some guy in 1993 wrote a pile of buggy, poorly documented C code to determine the spacing between full-width East Asian characters, then cashed in his options and moved to Bhutan to become a monk.

When preparing the next version of Word, the project manager looked at his code, decided it was unsalvageable, and had his team rewrite those routines from scratch. But the old code lived on as a totally black-box routine, used only when rendering a Word 95 document.

I wouldn’t be surprised if those “too complicated to explain the behavior” tags listed in the article are essentially an enumeration of all of the opaque, undocumented, legacy routines that live on in MS Office.

Of course MS will push the importance of backwards compatibility — the “correct” behavior of old-document formatting exists only in fragments of incomprehensible code that only they possess.

I’m fully in agreement with the conclusions of the post: such a “standard” would be a vendor-specific abomination.

Rob Weir said “When WordPerfect receives an OOXML document, it has no idea if it was created from scratch in Office 2007 or whether it was an older Word 95 document that was converted to OOXML. The person opening the document may not know its provenance. All they know is whether, when loaded in WordPerfect, it displays correctly or whether it looks like crap.”

Not quite true, but that does not discredit the point you are making in the original post.

I, as a vendor (http://xlsgen.arstdesign.com), know when opening an Excel file (Excel 2007 or older) whether I did create it, whether it was created from Microsoft, or by some other third-party.

How? because there is a ton of undocumented details that are written or managed only by one of these. Examples are : elements and attributes that are only created by a component whose intent is to fully support the format (usually those are the more verbose ones) ; optimization of formatting styles or shared strings/formulas as a revealer ; order of XML elements and/or attributes.

“Considering the requirement that the standard allow for compatability with existing documents, what would you suggest?“

Either:

A. Don’t claim it’s a “standard”, call it what it is: An XML file format for Microsoft.

or

B. Do exactly what Rob Weir said in the article: Factor the behaviors into something expressible and then, well, express them — in the level of detail required for compliance. Just like any other standard.

Jonathan: “Considering the requirement that the standard allow for compatability with existing documents, what would you suggest? “

I’d suggest generalisation. Find out what the general property or properties that “Being like Word95” imply and model those. Then when you write the OOXML document instead of writing “Be like Word95” you write something like “Use 90% line spacing”, “shrink small capitals by 10%”.

This takes more effort in software design, but is far more generally useful and easier to maintain and extend. This specification does not bode well for Word’s internal software design.

@guy: Your post is pretty much nonsense.You take the sentence “… but in reality few people have a need to absolutely FAITHFULLY convert a file written by a 16 year old application.” and interpret ‘convert’ as ‘render’. Of cause you need to be able to render legacy documents absolutely faithfully – however, thats simple: convert to pdf – done. If you to edit a document, thats a whole different story and in that case its rarely (read: almost never. even in the situations you discribe) requiered, that the old document is kept in its native form.

As for the guy who said that Excel 2007 is backwards compatible, he’s got to be kidding. Excel 2007 does not render existing charts correctly.

Just create a chart say in Excel 97, and open it in Excel 2007. Tons of little and important changes.

That’s because they use a different drawing library. Frankly, why should customers have to pay the price for Microsoft mistakes again and again? Changing the chart engine (I am only taking this example here) is a big deal. Perhaps this explains why those poor decisions are always taken inside the fence, not openly.

people still use microsoft word? whats that like? I thought everyone had got tired of spending 3/4 of their time making a document fit on a page A4 and repairing the formatting and moved onto other things?

Guy:From the descriptions of the compatibility tags, are there really any that will completely corrupt these precious documents? Seems to me that many of these are trivial tags that might impact per-pixel representation.

So what? Does my medical documentation really care if the font is slightly too big? Will the blueprints of a bridge care if the tab spacing is off? (and scanning images to include in a Word document? Tsk tsk.)

If OpenOffice.org can import a WordPerfect 5.0 document and correctly display the top spacing, then they know how to render suppressTopSpacingWP already and don’t need to be told.

If they ignore the WP5 top spacing brokenness, why wouldn’t they ignore suppressTopSpacingWP as well? How are they worse off?

Either they care about maintaining rendering fidelity with documents created in WordStar or whatever or they don’t. If they do, they already do so for documents they import from WordStar binary formats. Didn’t care enough to import WordStar? Why start caring now?

Yes, MS will be the only people to implement the whole thing. But this is about interoperability. As long as they don’t start setting the mwSmallCaps flag in new documents created in Office 2007, there is no interoperability issue.

ODF was designed and implemented by people who understand the importance of being able to accurately render documents not only with programs 16 years old, but also programs which have created documents which are much older than that such as Multimate and Wordstar.

This, unfortunately, is wrong, wrong, wrong. In the spreadsheet part, ODF cannot even get itself to define the syntax of expressions. Let alone the precise semantics of what functions like sum are supposed to do.

There is not a prayer of chance that an ODF spreadsheet of today can be expected to load in a future ODF system. And if it does, forget about calculating the same answers.

This is all nice and well, but in reality few people have a need to absolutely FAITHFULLY convert a file written by a 16 year old application.Here’s the thing: what’s to say that Microsoft’s current software doesn’t rely on these “features” to work correctly?

We should all keep in mind that most documents in business are not started from a blank sheet. Typically, an existing document is opened and “tweaked” to reflect its current intent.

I am routinely forced to use a powerpoint template that was created with Office98 for Macintosh because it was heavily used and tweaked.

Whether Office2007 will open the legacy documents and translate all their idiosyncracies into non-idiosyncratic OOXML (highly unlikely), these documents will continue to live on far longer than most people realize.

When the managers start trying out that new OpenOffice that is supposed to support the OOXML “standard” and OO cannot render these legacy documents accurately, but Office2007+ can it’s back to Office2007 because it’s “easier and less trouble.”

I agree that there are probably good reasons for having these tags with their “inappropriate” rendering internally by MS. However, to leave them unspecified in a “standard” is a little tough to swallow.

This really smacks of laziness AND nefarious intent on Microsoft’s part. It’s laziness on their part to simply encode these workarounds for backwards compatibilty as a one-liner that essentially says ‘this is a workaround’, and nefarious because Rob is right: They’re doing it to prevent other OOXML vendors from providing a complete solutions. Yes, there may not be any Word 95 documents encoded as OOXML yet, but you can bet when you open a word 95 doc in Word 12 and save it as OOXML, it will have these compatibility tags encoded in it, meaning only office 12 will render them properly.

Are the people defending the MS position because of the need to render old documents, really saying that rendering old documents “accurately” includes putting parts of a footnote in the clearly wrong place on a page? Surely what matters is that a footnote is recognisable as such and placed in a footnote-like place on the page. Insisting on backward compatbility down to reproducing broken behaviour makes no sense for anybody – if the original very imortant legal document was understandable with the broken behaviour of WordImperfect Version 0.2, then surely rendering it in a roughly similar way (hopefully slightly better) is good enough? Or am I missing some cunning legal point here?

On the ‘is 1900 a leap year’, Microsoft at some time sneakily fixed the problem. It is no longer a leap year, but the functionVariantTimeToSystemTime(note–variant time is a double in days–system time is a display format structure (year,month,day,hour, etc))now converts 0.0000 to midnight (0h:0minutes), December 30, 1899.So Jan 1 1900 at 00:00 is 2.00000, and all the dates in your sql database before March 1, 1900 are off by 1.00 or 2.00.

“Considering the requirement that the standard allow for compatability with existing documents, what would you suggest?”

Jonathan, I suggest Microsoft stop promoting their file format as general-purpose and stop pushing it as a “standard.” It makes sense as an internal format for Word and it makes sense as a way for other applications to (try to) faithfully import Word files.

As far as a competitor to OpenDocument, who in the world would go with this as their native format? Think of all the word-specific crap you’d have to consciously ignore when reading/implementing/debugging against specs.

I think it’s about time that Microsoft finally walks away from always being compatible and start something new and fresh or they are always going to be bloating themselves for 20 years worth of document types.

If users want backwards compatibility, save it in a DOC or XLS format. If they don’t need that backwardness, save it in the XML DOCX or XLSX format and from here on it that file will be compatible.

No document formatting formats before the year 2000 should be supported. Instead, every company should provide a conversion program from their own legacy formats to the new standard. For companies no longer in business an open source conversion program that reverse engineers the format can be written. If we don’t do this, we will have to repeat this legacy formatting stupidity for each new standard. Lets fix this once and for all NOW!!! No proprietary standard should be tolerated in the future – period. If this is a government agency requirement, then all current vendors must support this.

It seems to me that nearly all of these "legacy" documents everyone is talking about need to be read but not written, converting them into PDF or a similar format seems to be the only sensible solution.

Secondly, in some federal agencies, there are already problems working with data saved with older versions of Microsoft and Corel office applications. By "working with", I mean reading and printing, as modifying such old data would compromise the integrity of the archiving (which is the reason such data still exists). This standard, as currently written, does nothing to improve that situation, as it encourages the retention of non-content-related legacy artifacts within the converted documents.

I’m as anti-Microsoft and pro OASIS as anyone else, but this seems like excessive flame-bait. These are admittedly retarded behaviours from legacy apps and I can 100% agree with the idea that the spec should get even WORSE bloat from addition on how to comply with them.

Please post some specific and broader issues with the spec rather than nitpicking at obvious but small-scoped irritants.

I find it strange that anyone thinks it is worth so much work to make sure that their Word documents remain exactly the same when rendered onto the printed page. In my experience, merely changing the installed printer drivers will alter the precise word and line spacing of a document when opened in Word. Changing to a different version of Word (using the same file) does so even more!

Guaranteeing the precise layout is what PDF and Microsoft’s new PDF competitor XPS are for.

I think this is a bit overblown, and more a comment on Microsoft’s recurring inability to understand its own old code than anything else. Back when Samba was just beginning to come into its own, it wasn’t infrequent for the Samba developers to uncover “features” that were unknown or forgotten to Microsoft.

Having said that, I will defend Guy’s example of military specifications. Moving footnotes and changing print layouts can in fact be important in maintenance docs and other things that rely on page numbering and appropriate spacing. I know – I used to work for a company that produced software that rendered a lot of them to print. They’re pretty particular about it.

I’m as anti-Microsoft and pro OASIS as anyone else, but this seems like excessive flame-bait. These are admittedly retarded behaviours from legacy apps and I can 100% agree with the idea that the spec should get even WORSE bloat from addition on how to comply with them.

Please post some specific and broader issues with the spec rather than nitpicking at obvious but small-scoped irritants.

Hello? These "nitpicking irritants" are some of the most important reasons why OOXML is not an open standard. The broader issue is that the spec is narrowly tailored to fit the internal operations of one company’s product, including its historical versions. The spec does not give enough specific implementation details to enable Brand X Software to make a fully-compatible application.

It really is not about hating Microsoft. If their products can compete with Sun and IBM and whomever in an open standard environment, then they deserve to win. If they want to implement all of the backwards compatibility they can come up with, then so be it. It just should not be squeezed into an open standard specification like this.

ARGH. To “beg the question” is to pressuppose your conclusion in your premise. You mean “raises the question”. Come on, English is quite ambiguous enough without randomly abusing its phrases like this.

I have documents dating back to 1992. I open them just fine in modern word processors: The document format of choice at that time was, surprise surprise, plain text with a couple of formatting codes.

I can even open my Commodore 64 documents just fine. The only catch is that my character set converter of choice (GNU Recode) doesn’t do PETSCII.

The lesson I’ve learned is this: If you want the text to preserve, use plain text or don’t expect the transition to be perfect. If you want perfect formatting, dig out your PostScript files (or PDF in new documents).

And if Microsoft doesn’t describe how to stay compatible, heck, this is a completely doomed effort anyway.

Plain Text is a well-defined standard for information interchange. HTML is a well-defined standard for information interchange. OpenDocument appears to be a well-defined standard for information interchange. OpenXML appears to be an Extremely Obfuscated, Difficult to Parse Plain Text Format that can be attempted to be used for information interchange if you get severely enough drunk first. But with these specs, don’t expect compatibility.

I don’t know if it has been stated here, but you do know that supporting these the compatibility options is not required for OpenXML compliance? Developers are free to leave these out of they want.

They are also free to re-encode them in valid OpenXML formatting. OpenXML is XML. That means it’s to edit. If people care so much, they could–with relatively little effort–create macros or a regular expressions that re-encode what many of these tags do directly into the document formatting itself (e.g. for WP line spacing, change all linespacing to fn(linespacing).)