The world does not need a "conversion nightmare": a standard office file format already exists

This is an editorial about file conversions. It starts with a story about Free Software Magazine and our struggle with article formats, and continues explaining why the world needs to get rid of Office Open XML, which could create more problems than the Microsoft monopoly itself.

When I started Free Software Magazine, we faced the problem every publication needs to face: which file format should we use for articles? It was a few years ago now (as they say, time flies when you have fun!). At the time, the web site wasn't our main focus: we were actually printing a paper magazine (!), we were generating amazing PDF files using LaTex, and decided that a static web site was going to "do" for quite a while. We decided that the "master" format for our articles would be XML. XML seemed like a good idea at the time. None of the other options seemed quite as feasible: text wasn't enough, HTML was too vague, ODF was too complex, and so on. Plus, everybody was using it.

Since we couldn’t find a single decent semi-visual XML editor, we asked our authors to hand in XML directly. Of course, people became very creative when they created an article file: we had to write a script that deleted white spaces around tags, and generally "cleaned up" the XML files we received. We also had to check manually that the files had the right em dashes, the right opening and closing speech marks, the right apostrophes, and so on. I won't even get started on the problems some authors had with getting the XML right: <p> tags left unclosed, <li> items without <ul> first, and so on. It doesn't sound complicated, but when you have a 2500 word article full of listings, text boxes, figures and so on, and (even worse) when the XML error you get from the parser is as unhelpful as it could be, things got tricky. It was a small nightmare, which repeated with every issue of the magazine, and nearly every article. Two prospective (and influential) bloggers refused accounts with Free Software Magazine when they realised they would have to spend time tagging up XML files. Laziness? Maybe. But, as we say around here, "fair enough".

Luckily, the delirium is now over. We have upgraded our article format to Markdown Extra (although, it has a few tweaks to allow tables and textboxes). Authors can now write articles following this Free Software Magazine article template. Issue 21, this very issue, was edited mainly using the new file format.

Converting the articles from XML to Markdown Extra/FSM was a lot of hard work. I just about managed to do using XSLT with custom PHP calls within the XLS file. (If you are thinking "the XSLT from a basic format to Markdown should be simple", I will give you a few keywords: "white paces", "enters", "tables", "clashing escape characters", "CDATA", and so on). The conversion required substantial trial-and-error and tweaking. It contains several hacks I am not especially proud of. To date, I am not yet 100% sure it actually works for every single article. And we are talking about translating an extremely simple XML format into an extremely simple text format. As always, the conversion part was easy. However, getting it to actually work was tricky.

This change won't affect you --well, apart from the occasional due to the occasional hard-to-translate article (we have over 2000 articles in our database, and we checked things by "statistical sampling"...). What is interesting is that this adventure (which I named "article conversion hell") reminded me of something that sounds obvious, but we tend to forget: file conversions are complicated, sub-optimal, time-consuming, imperfect by nature, often wrong, often the result of guess-work, tricky, and basically evil. When you open a Microsoft Office 2000 file using OpenOffice, things might work seamlessly, things might look a little odd, the file might look perfect--but if saved back as a Microsoft Office 2000 file, it might be ruined forever. There is a reason for this: file conversions need to be avoided (especially, like in this case, if the original file is an undocumented back-back-back-back-backward compatible format which really doesn't deserve to exist anymore, and didn't deserve to exist in the first place). ODF isn't perfect (yet?), but it aims at being the format for office documents. It's standard, and several pieces of software today can handle it (see: it's not an OpenOffice-only game).

Microsoft trying to shove OOXML down ISO's throat (and effectively damaging, maybe beyond repair, the image of what should be an independent body) can damage the computer industry immensly. The fact that both ODF and Office Open XML are XML means absolutely nothing. You can see here a technical comparison between the two: converting one format to the other is anything but fun. Thousands of bogus documentation pages that come with OOXML don't help.

What I experienced with Free Software Magazine while converting (which, admittedly, wasn't really that big a deal) would be nothing compared to what the whole world will have to deal with if OOXML became "the" file format "normally" used to exchange office documents. A situation like this will impose constant conversions, quirks, compatibility problems, and so on all of us It will also be a fantastic card for Microsoft: "look, GNU/Linux is sort of good, but you know, you can never trust it to open an XML file... sometimes the images are squint, you know..."

Microsoft knows this. Unsurprisingly, they have recently announced that they would release several conversion tools to translate ODF into OOXML and vice-versa. I read the article right in the middle of my "article conversion hell", and wondered if anybody else realised how disastrous it would be, if Microsoft managed to convince the world that it was "OK" to have two competing standards, since it's so easy to convert them into each other. The risk is very real: if we don't stop them, Microsoft will muscle its way in, and will force the whole world to fight with conversions for years, or decades, to come.

Microsoft proposed a bogus Office file format while an ISO standard already existed. Their shady practices to get their format fast-tracked and approved by ISO didn't work. But Microsoft is still trying--and I can guarantee, it will keep on trying until it succeeds.

The only possible answer for Microsoft and OOXML is simple: the world already has an office file format. The world doesn't need nor want a "conversion nightmare". The world's ISO-approved Office format already exists: it's called ODF. Microsoft: deal with it!.

Comments

I fully agree with your points. Having recently worked in a group where most people used (pirated) Word, I can certainly relate to the issue. Not even older versions of Word can read ooxml. The file size is larger than odf too.

Btw: I reallt miss the pdf version of the magazine. Did you ever consider to bring it back?

I am doing my bit to get the great number of M$ hooked users to realise there is an alternative to M$ Office.

If I need to send any documents as attachments to anyone I send odf files as standard, with a link to openoffice.org. So far only one person has mailed back asking if I can send the file as a M$ .doc, though I did get one file returned in the new M$ Office format that is not readable by OpenOffice.

Hello:
Read your article on file conversion and agree whole-heartly. One
standard is all that is needed. I do not use Microsoft's OS or office products. I find Open Office quite capable for my use.
The problem is purely greed and nothing else. When the bottom line is nothing but profit and not in the interest of the customer you will never have a product that is competitive and embraces change.
Mr. Gates is a shrewed businessman and knows what will make him money and keep him in control of the market. The problem is tha Open -Source has not accepted that idea per-say. The open-source community
has not fully recognized the need to have one standard themselves. When they come to this point, then and only then, will they appeal to the masses.
I am probably ranting in the wrong place, but I do enjoy your magazine very much. Keep up the good work.
I don't know much about html. (Sorry)
Thanks: Utah C. Burger

I am a total non-geek who has been using Linux for a few years. I have been using open office and abiword for the past few years with no problems but I have no idea what all the fuss is about. HTML and XML both work for my basic site. To me, anytime something is over-hyped, someone is trying to sell me a bill of goods.

I have been using open office and abiword for the past few years with no problems

This may be why you have no idea what the fuss is about :o)

HTML and XML both work for my basic site.

I think you've got the wrong end of the stick here. XML is a document standard which pretty much permits anybody to define their own markup tags (schema). As long as you have all your tags properly defined and they are used in the correct manner then it's XML.[1] But You don't have to tell anybody what the schema is.

Tony is using his experiences over converting the FSM website XML to Markdown as an example of what lies in store for people writing converters from one XML schema (ODF) to another (OOXML). The problem is that the creators of OOXML a) are not really telling everyone else all they need to know and b) have a history of changing such things without notice just to keep their market share.

Think of it like this: You define an XML schema which contains tags for bold and italic text (Let's say <strong> and <em> respectively). Now I define one but my tags are <bd_0> and <i_0> . The thing is I don't tell you what my tags are and I compress my documents inside an encrypted archive to which only my products have the key.[1]
How do you convert your documents to my format or vice-versa? Reverse engineering is against the licence under which I offer my products so you are stuck. Even if afetr a lot of brute force you get an unencrypted document - what is to stop me "upgrading" my schema and slightly changing the tags you now have?

The only thing that prevents this kind of behaviour is to have an open standard which is settled upon by many not one and which is publicly available for all to use.

In every argument I have seen for OOXML I am yet to see anything which actually says why it is necessary for Microsoft to have their own format - other than greed.

To me, anytime something is over-hyped, someone is trying to sell me a bill of goods.

Ah, but this is about more than a single bill of goods - this is about tying you down so that every bill of goods you buy must go through one vendor (not a great analogy but I'm just continuing your remarks).

I missed the point of OOXML. I thought OOXML was designed to be open for all. I figured Microsoft was up to something fishy when they decided to champion an "Open code".

One good thing about me checking in to OOXML is that I am an early adapter. You guys lead the way. I am one of the people willing to try things after people do the hard work of making things user friendly. If I can stick with a distribution, I have no doubt the popularity is ready to take off. I have switched over to Linux for most of my home computers with the others primarily for games.

Maybe Microsoft sees us early adapters making our own web sites from Linux computers using web hosts that run Linux servers. I am using Word Press on one site and PHPBB on another. I have little technical knowledge to build my sites. I have two computers on one desk. One for instructions and one for the operation.

I hope OOXML does not become a standard. It was very frustrating to have to learn to add special code to make my stuff line up in IE7. It still does not line up properly in all of the different Explorer browsers.

Yes it's that word "open" at the start -- but of course OOXML is designed to make money. At least that is the only reason I can think of why MS would spend all that R&D time developing a format when an open standard one exists and is all they needed really.

I hope OOXML does not become a standard. It was very frustrating to have to learn to add special code to make my stuff line up in IE7. It still does not line up properly in all of the different Explorer browsers.

I think you may have indeed missed the point--slightly. Whilst IE7 will display OOXML directly (and not ODF BTW) and I am sure M$ would love it if people wrote web content in it--you should remember that OOXML (Open Office XML) is primarily an office document format and in that has little to do directly with IE.

That said I agree that coding web content to view in the various IE versions is a nightmare. My preferred technique is to produce four CSS files. One for Firefox/Opera/Konqueror/Safari et al, one for IE5.5, one for IE6 and one for IE7. Then I chuck in a script which detects the browser and presents the relevant css links.

Hi
Good writing - but there seem to be more and more stories comming out - in the file-formar debate - that the huge work on ISO standard - is about the future formats in M$ in 2009.

The scheduled 2009 release of Windows 7 (or whatever it's called)- and Office - have been in the pipeline for a long time - and the development of there new version of there OS and Office depend on M$ version of XML.

That is why they work so hard - and put a HUGE amount of money and people into this - to get there format to become a global ISO standard.

We saw it in the ODF debate - and this time the opensource community - feel the full force of this multi billion $ company - with there lawyers and 'independent research' that come out.

And M$ just don't get it - they just don't understand the world outside the company - as it was reported a couple of days ago http://www.fsdaily.com/Community/Microsoft_wants_open_sourcers_to_write_an_OOXML_translator