Friday links (July 6, 2007)

Rick Jelliffe makes my week – Rick has a great post called "Slashdotters: all together now… 'Doh!'" that pretty well sums up my experiences over the past several years. When we first announced the XML formats for Office, there was a focused collection of negative feedback on sites like slashdot, and there were demands that unless we did "foo" it wouldn't really be an open format. Well we've actually done all the things folks asked and then some! As you would expect though the they continue to move the target. We fully documented over 10,000 elements, attributes, simple types, enumerations, with over 6,000 pages of documentation and made that freely available; we removed any possible legal IP issues by putting the OpenXML formats under the OSP; we completely gave away the ownership of the formats to a standards body (ECMA, and now ISO) so that even if Microsoft wanted to, we couldn't unilaterally make any changes or block the availability of the documentation; and we've sponsored on open source project that provides translation between OpenXML and another international standard, ODF.

Sun builds ODF translator for MS Office – Speaking of translator projects, sun this week announced the availability of a plug-in for Microsoft Office that allows you to read and write ODF files. You can now see that there are people working on OpenXML support in OpenOffice and ODF support in MS Office. These different translation tools really allow the customer to decide which format they want to build their solutions around, and then use the translators if something in the other format comes along. Malte Timmermann has more information up on his blog: http://blogs.sun.com/malte/entry/sun_odf_plugin_1_0

One side note here is that it looks like we have a bug in Word 2007 where Sun's ODF converters are able to save but not open the ODF files. I looked into this a bit and it looks like Word will mistakenly assume that the ODF file is a .docx file (since they are both ZIP at the base level). Word sniffs the file to see if it knows what kind of file it is, and only if it doesn't think it can open it will it hand it off to one of the registered converters. Since we think it's a .docx, we actually try to open it, and then of course fail since it's an .odt and not a .docx. I commented on this up on Malte's blog, and I'll provide more information once we've come up with a fix. I'm not sure how long it will take to pull that together, but I'll keep everyone posted.

PHPExcel – There's a new update to the PHPExcel library that you can use for creating Open XML spreadsheets. You can see the project up on codeplex, or check out Maarten's blog.

Brian Jones said "We fully documented over 10,000 elements, attributes, simple types, enumerations, with over 6,000 pages of documentation and made that freely available"

There are many things factually incorrect in this sentence alone. But let’s stick to just one.

I knew you are not a developer, but I assumed you were a program manager, you were actually writing specs.

I have to ask the obvious question, "fully documented" = "fully specified" ?

The average feature specs is 20 pages. Between describing the feature, explaining decisions, explaining how it is supposed to be implemented as a functionality such as screen rendering, explaining how it is supposed to work with related features, explaining edge cases, explaining how it works over time and how it can be migrated forward and backward, explaining how it is compatible with international issues, 508 issues, security issues. With all that, to get a specs of ONE feature in 20 pages, means you are really packing stuff!

And yet, when you read the Part4 of the public specs, on average, a feature is described using a sentence no longer than 5 words.

Huh?

By my estimation, the fully specc’ed Word/Excel/Powerpoint are a monster 600,000 pages.

I certainly find it ironic that people out there complain of the "largeness of the public 6,000 pages", when really opening up would take 100 times more.

Everything you say is wrong.

Take "XML" for instance. Here is a bit of Microsoft idea of XML (this is from a real Word 2007 document) :

<w:instrText xml:space="preserve">TOC o "1-3" h z u</w:instrText>

As some people can see, there are angle brackets, so it must be XML!!

As for your channel 9 video, I’m baffled than you are filming your 5th or 10th Level 100 video where you keep saying exactly the same thing. Can you please move on and say something actually useful?

Are you joking? The example you list actually is documented. Sure it’s XML accompanied by a rather complex string, but we have a few hundred pages that document how field codes work, and the syntax for them. It’s a bit misleading to reference that though as an example of what Open XML looks like.

The reason I’m still talking at the 100 level is that there are still a lot of people out there who are just learning about this for the first time. That level of education is super important, and while I’d love to go deeper, I need to be careful.

Not sure what you mean by the last question unless you’re just trying to be rude. Jason and I are on different teams. I’m a senior program manager lead in the Office team. From the development side I’ve owned the file formats for years now, in addition to other areas like custom XML, programmability, WordMail, smarttags, smartdocuments, etc. Jason works on the standards team. I’ve worked together with Jason and other folks like Jean Paoli and Tom Robertson on the standardization and policy side of things for awhile now. That portion of my job is a bit outside of the traditional program manager role, but its been a lot of fun.

Back to you’re main point though… I’ve already said this before, but I’ll say it again… you’re really getting to be annoying as you keep repeating the same point.

You are expecting the file format documentation to also fully specify the application, and that’s just not the case. Look at other file format specifications out there and you’ll see the same level of documentation. We are not specing an application, we are specing a file format. If you want the specs on how to build Office, come work for Microsoft and I’ll show them to you.

@Stephane, don’t blame MS for this sort of XML specification they are trying to standardize:

they are new[1] to this "open" thing ( the "old-good-times"[2] are gone )

May be in a couple of years they will learn how to give to the world a *really* open, useful, implementable-by-all-not-only-MS, and interoperable standard ( and i’m not talking about this[3] kind of interoperability )

"This element [of type "CT_OfficeArtExtension" ] specifies an extension that is used for future extensions to the current version of DrawingML. This allows for the specifying of currently unknown elements in the future that will be used for later versions of generating applications.

..

Attributes Description

uri (Uniform Resource Identifier): Specifies the URI, or uniform resource identifier that represents the data stored under this tag. The URI is used to identify the correct ‘server’ that can process the contents of this tag.

First of all, I am glad you answer this time. I note that you sometimes don’t, and am not sure why. We could keep the discussion in private, but it so happens you don’t answer emails, so I have to put it in public.

"Are you joking?" I’ll leave it as an exercice for readers to make their mind whether or not the stuff mentioned above is XML, ie can be processed with an XML parser, or not. This is what I mean when I say Microsoft’s Office XML was never designed with XML in mind. I think you are painting yourself in a corner here, since I know you are pretty much directly responsible for Word’s XML. I’m an Excel guy, that’s the only reason I don’t often talk about Word’s XML. But you get the point. This XML is really ridiculous, and by that I don’t mean to be rude, it’s just factual.

Dare you say that backwards compatibility led you to leave all this poor encoding as is? Really? MeThinks it’s more laziness, in addition to making sure the barrier to entry for serious implementers is kept very high (analogous to "deprecate" VML parts, for instance).

"Jason and I are on different teams."

Judging on your blog the links to people posting positive things, even though barely constructive or useful at all, reading your blog reads like a blog from an evangelist. And to use the word evangelist is to be nice.

"The reason I’m still talking at the 100 level"

Why don’t just integrate one of your videos on your blog then? Why the need to create more videos with exactly the same content and exactly the same guys? Don’t you think your audience might find it a little suspicious that a 1,000+ product group always goes public with the same three faces. Is there a reason why actual developers are not going public? Or is it too touchy a subject (until ISO time stamps this crap, obviously) ?

"You are expecting the file format documentation to also fully specify the application, and that’s just not the case."

To become an international standard, you’d better freaking specify this stuff!

"Look at other file format specifications out there and you’ll see the same level of documentation. "

You are so wrong, it’s unbelievable. I’ll tell you a little story, think what you want of it. It so happens I spent the week-end pushing amazing OpenOffice integration in diffopc (I guess you know what diffopc is). And I spent time reading some of the ISO OpenOffice specs that defines how styles are supposed to be understood and processed (inheritance, …). Believe it or not, everything is in the specs (it’s 730 pages only). The format, styles particularly, is designed so well, that you don’t need anything else to FULLY implement this stuff. And I have done just that in diffopc (it’s amazing now, and I have added support for Word 2007’s XML styles too by the way).

So any way you look at it, I think what it shows me is that 1) you know little outside Microsoft fence 2) you never implemented the specs.

It’s pretty obvious you did not implement the OpenOffice specs, but I’ll leave it to readers as an exercise to estimate how much of ECMA 376 specs you have implemented…

I’m a real implementer. May be you are not ready for negative comments. Deal with it.

For the umpteenth time, we are dealing with a file format spec, not an application spec. You just don’t get it (or you do get it, and are just trying to spread FUD).

Also, to be an international standard, a file format spec does *not* have to also document the application it was originally tied to.

And by the way, since you have seen fit to throw some of your negative comments my way in the past, let me state I have no relationship whatsoever with Microsoft, and never have. For what it’s worth, I have 45 years of software development experience, and 20 years of file format experience, starting with GML and SGML.

"For the umpteenth time, we are dealing with a file format spec, not an application spec."

What we are talking about is making a specs available so that it is possible to implement it, right? The trick Microsoft is using is the angle-bracket trick. I think, for a person of your experience, the trick is pretty thick.

The 6,000 pages is another trick. The specification is actually very scarce, if you start reading it, it should be quite obvious. It’s long only because there is 15 years of legacy. But, as a comparison, even a cursory read of ODF specs, which I did this week-end, proves the immense and fundamental gap betwen the two. For ODF, to implement styles and feel safe to claim support for styles in no less than 19 XML-based OpenOffice formats (*.ODS, *.OTS, *.SXC, *.STC, *.ODT, *.OTT, *.SXW, *.STW, *.ODP, *.OTP, *.SXI, *.STI, *.ODB, *.ODG, *.OTG, *.SXD, *.STD, *.ODF, *.SXM), all I had to do is read Chapter 14. And since it is carefully designed, I don’t need to care whether I’m dealing with word processing, spreadsheet, or anything else, since the style concept is unique. Now take OOXML, even a cursory scan of the specs shows there is at least 6 different ways to do text formatting (a subset of styles). I can go on with more examples. Even worse, when you are done spending weeks implementing say styles in Word, or something narrower such as theme-less endnotes styles in Word, then you’ve still implemented nothing for Excel styles, and so on.

It’s very simple, the "backwards compatible" claim from Microsoft is just the main propaganda behind which they hide to make sure the barrier to entry is extremely high. Somebody starting from scratch will spend at least ten years to implement this stuff, and along the way will have to do plenty of reverse engineering since most of the semantics is not documented (that’s what I said with "attributes documented using sentences of 5 words on average"). It’s ok as a regular file format, not ok for an international standard. It has no merit as an international standard. Is US ISO comments, when the Microsoft propagandist (whose first name is Rex) is asked why Microsoft is using a date type that is incompatible with everything else, and also does not build on existing ISO specs for dates, his answer is "we felt it this way". Sure. By the way, Microsoft did support ISO8601 for dates in…Excel 2003 XML. In Excel 2007 XML, we are back to the proprietary OLE-based date type (leap year bug, 1904 coordinate space, does not support dates older than 1900, …)

As for negative comments, well, it depends. Are you implementing this stuff? If no, how legit are you to say I am wrong?

I’m curious to know, since you are eager to point out you’ve done SGML in the past, how much you reconcile the following with XML markup, or more accurately markup with XML in mind :

<w:instrText xml:space="preserve">TOC o "1-3" h z u</w:instrText>

Can’t you just understand that a carefully designed format allows anyone to support stuff with simple stack/tokenizers, keeps the barrier to entry low. It’s just the exact opposite with OOXML.

If a long experience should teach you something, it’s certainly to not do the same mistakes.

"a file format spec does *not* have to also document the application it was originally tied to."

It’s tied to it unfortunately. The specs is just a reflection of the implementation, not a general purpose model for Office documents.

In fact, the specs came after the implementation. That’s what makes it so ironic. Anyone who read it will notice plenty of typos, proving that this was NOT used to implement Office 2007 (it’s the opposite).

And it has tons of vendor-specific stuff in it. Something forbidden in international standards.

"Bill’s the General Manager of Platform Strategy at Microsoft. What got really interesting was when Yusseri raised the issue of OOXML and why didn’t Microsoft just work on ODF in collaboration instead of creating a new, bloated standard. Bill’s answer was quite surprising, as he clarified that the file format (OOXML) was a part of the software and that OOXML and the software (MS Office) are quite inseparable."

The string `TOC o "1-3" h z u’ is a "field code" in Microsoft Office. This particular one tells something about the Table of Contents (TOC).

Field codes have been there since day 1 of Microsoft Word, and pre-date the existence of XML. Most other word processers use them too. For example, Corel Wordperfect uses them. As an end-user, you don’t normally see them, but they are essential to the proper operation of the application.

To see them in WordPerfect, you can issue a "reveal codes" command. (Microsoft has a similar command.)

These field codes are literally embedded in billions of MS Word documents. There’s no conspiracy here — as I said, they are essential.

So, yes, field codes are an example of metadata which are represented in a non-XML way in most major word processors.

Now put yourself in the shoes of a software vendor like Microsoft. You’ve just changed the file format of your application to XML. What would you do with the XML representation of the field codes, given that you cannot afford to "break" anything for your customers?

I think I can guess Stephane’s answer — to hell with the *customers*, let’s make everything to do with the application pure XML, so that the job of *developers* is easier.

That’s something that a vendor who has next-to-zero market share can do without too many problems. In my opinion, it’s not something that a vendor with a huge market share can do without committing commercial suicide.

We had looked into mapping the field codes into more granular XML markup, but the problem is that you can put basically any text you want into a field code. There is a predefined syntax that Word understands, but that doesn’t mean that you can’t put something else in there.

Issues like that can obviously be worked around, but in our view it was most optimal to just write out the text representation, just like it appears in the document itself. Ecma TC45 could always decide to create "Fields version 2" that are more structured, but for the first version of the spec they just line up with how they are stored in the existing base of documents.

Many formats do this, it’s not just Open XML. ODF has it in things like formulas; HTML has it in CSS; etc.

The key difference of course between the field codes in Open XML and the formulas in ODF is that the syntax is fully defined in Open XML, and completely undefined in ODF (they are working on it for a future version).

That’s fair enough, although to be honest I can see how developers could feel a bit cheated if they thought they were getting Office Open XML, but wound up with Office Open XML + Office Open Legacy Cruft. How about doing a post talking about all the little sublanguages that applications have to parse? That would let the impatient start work with realistic expectations of what they’re getting into, rather than becoming alienated after investing a lot of time in a project.

On rereading my post, it’s probably not clear what I meant. I’m referring specifically to the syntaxes of different sublanguages, rather than meanings of individual metadata. So for example, the equivalent post for HTML would be:

The idea would be to present enough detail that developers don’t walk in with naive expectations about how trivial it’ll all be, and enough information that they can make good decisions about which libraries to use.

I’m intrigued by the "at least 6 different ways to do text formatting" you have mentioned on a couple of occasions.

What exactly are these? I can think of direct formatting, character styles, and paragraph styles–but that’s only three, and all have very important uses. (I’m excluding weird methods, such as using comments for formatting or embedding postscript commands.)

"I think I can guess Stephane’s answer — to hell with the *customers*, let’s make everything to do with the application pure XML, so that the job of *developers* is easier.

That’s something that a vendor who has next-to-zero market share can do without too many problems. In my opinion, it’s not something that a vendor with a huge market share can do without committing commercial suicide."

That’s a gross mischaracterization of what I have said or implied. What I said is that Microsoft is hammering for two years that this format is "100% XML", and it isn’t. They also say it’s documented, and it isn’t.

What they don’t say is that it’s poor design, and the 15 years of legacy are being surfaced onto everyone’s face now. Emperor with no clothes. It was easy for them to just put angle brackets around their crap, and claim "it’s XML".

As for breaking file formats, for the better, isn’t Microsoft introducing 12 new and incompatible formats in the Office 2007 timeframe? What stopped them from taking advantage of the opportunity to do something actually right, come up with a proper XML for Office documents? Have you read my example with styles and how ODF is amazingly superior in its design ?

Clearly you’re a smart guy that knows plenty about OOXML, and you’ve made a lot of good points about the subject. I can tell that you’re angry about the format, but letting that anger bubble over in public does nobody any good.

It’s very cathartic to write a great screed about how OOXML is Satan’s own document format, but clicking ‘submit’ just serves to widen the chasm between the OOXML and ODF communities. Yes, OOXML isn’t 100% XML. In fact, if you ask nicely, Brian will admit as much and even write a blog post explaining what else it is.

You wouldn’t release the first alpha build of a program to the public, so why release the first draft of a post? Think about your design goals for the post, about the use cases and about how your users (readers) will perceive your message. That’s all I did, and it’s let your argument move on to the next stage.

"It’s very cathartic to write a great screed about how OOXML is Satan’s own document format"

I don’t say that. I say ISO should reject it with a laugh. As a file format, party on, I do reverse engineering as a living so I don’t care that this stuff is crap and is not documented.

But Mr Jones has been making a lot of bold claims ("full XML", "fully documented", "backwards compatible", "platform independent", …). I call BS on all of this, since it does not survive analyis, and I think it’s very sane. That’s why I asked if his boss was Jason Matusow, Mr Spin.

The best part of course, as anyone can figure out, is that OOXML is actually not new at all. Just the perception that there are angle brackets is what the Microsoft guys (and bribed partners) are betting on to make sure it passes ISO, and then government regulations. A behavior like this is beyond pale. You can discuss screed going on public all you want, I don’t think I am doing half the harm a hugely influential corporation is doing not just to the software industry (if you support this, basically you can’t be trusted anymore), but to everybody’s life in general since software is now so pervasive.

I envision a world where all this legacy crap is thrown where it belongs, and we start working with Office file formats designed for the future.

This world is strikingly similar to what the ODF guys have been doing for a number of years, openly.

First off, you’re right about that "Satan’s own" remark. It was flippant, and I apologise.

As to bold claims with little backing, yes those are all excellent points – and I’ve delurked after enjoying this debate as a spectator sport for 6 months precisely because of that.

My point is that just saying how bad OOXML is doesn’t help anybody – I don’t mind when clueless newbies do it, but it’s painful to watch an important, substantive argument descend into a shouting match.

On the topic of full XML, you’re right. Brian Jones just agreed you’re right, but said it doesn’t matter. Now you can argue that it does matter on general principle, or you can wait for his post and argue that it matters based on the evidence, or you can move on to arguing about whether OOXML counts as fully documented; but continuing to pin him to the "OOXML is 100% XML" argument just makes it harder to move this debate forward.

I don’t think he did. In his comment above, he tried to confuse "office application specs" with "office document specs", only to avoid answering why the XML attributes (i.e. features) are barely specified.

Either that, or sometimes to avoid admitting truth, he just says it does not matter, on the grounds that not many people were using that particular feature. One year ago, I told him this stuff was not fully documented and was not ZIP, all it takes is to password-protected your document (Office 2007 creates an OLE document instead of ZIP, and uses an undocumented algorithm to encrypt it). I can tell you one thing, it’s that customers out there certainly use password-protection a lot.

But you are right, it does not matter a lot. The reason why is, judging the US ISO national body alone, half of voters are either Microsoft employees, or bribed partners (those guys Brian and others are offering links on their blogs). I don’t know much about other national bodies other than, according to blogs out there, some of these are invaded by Microsoft lobbyists. So to me the war is lost. And it’s sad that those defending a good cause (i.e. open office file formats) are wasting their time in national bodies telling Microsoft how much they screwed up the opportunity to be good guys.

I agree that the answers on the "fully documented" question don’t seem to be very satisfying yet, but I see that as a different issue to the "fully XML" one. The comments I’m referring to about accepting you’re right on XMLness are:

"So, yes, field codes are an example of metadata which are represented in a non-XML way in most major word processors."

(Monday, July 09, 2007 6:50 PM by Ian Easson)

And:

"Ian’s response is pretty spot on."

(Monday, July 09, 2007 8:49 PM by BrianJones)

Moving on to fully-documented though, this is a more complex argument and I have to say I’m not sure that I’ve understood what you’ve been saying. Is it your position that OOXML should define every feature in enough detail for all OOXML-compliant applications to render that feature identically on-screen? If so, do you have an example for this like your excellent fully-XML example?

It occured to me that what this blog may need is to be split into two, in order to better address the differing needs of the two main groups of stakeholders concerned with OOXML:

– End-users / customers (including decision-makers in IT departments, government, etc., and those interested in the standardization process)

– Developers and those interested in a more technical nitty-gritty approach.

Blog entries in the first blog could still have high-level architecture diagrams, but would reference the technical (second) blog ("for more details see…".

The technical blog could contain things like:

– The SpreadsheetML blog entry that Brian is working on

– All examples of actual XML or code

– Brian’s response to Andrew’s questions abou sub-languages

Such a split of the blog could also serve to calm down some of the frustrations expressed here by developers who feel they have to weed through what they find a lot of "marketing" (i.e, they WHY, WHERE, WHEN, and WHAT) in order to get answers to specific questions about HOW to do specific things with OOXML.

For the "full XML" debunking, don’t forget the trick : although all I need is ONE counter-example to prove Jones’s bold claims wrong, he usually side steps the issue, and tries to get me to put more examples. That’s more or less the same game Ian is playing apparently. Since I cannot possibly spend my entire life posting examples, that’s how they get their way out.

As for the "fully documented" debunking, I think it speaks by itself, just read the specs and you’ll see that in no way someone can possibly implement this stuff in any meaningful way.

I said in an older thread that you can certainly read, or write, kilometers of angle brackets. And that’s the trick used in their "custom XML scenario". But obviously, if you are willing to instantiate documents, you need to correctly process the attributes. I have a two part answer to this.

First of all, assuming I create an XML markup where I store all kind of vector coordinates. I hand it you and I say "here is a next-gen document format. Instantiate it.". How can you possibly implement a client of such format without being told the coordinate spaces in which those vector coordinates get instantiated? That’s what the specs are for.

More specifically, let’s just create a Word 2007 document, save it, close it. Now unzip it, and open the theme part (/theme1.xml). You will find occurences of :

<a:schemeClr val="phClr"/>

Now head over to the public ECMA 376 specs, and if you lookup "phClr", you’ll find this definition (fully quoted) on page 3772 Part 4 :

"phClr (Style Color) : A color used in theme definitions which means to use the color of the style."

Now head over to the public ECMA 376 specs, and if you lookup "phClr", you’ll find this definition (fully quoted) on page 3772 Part 4 :

"phClr (Style Color) : A color used in theme definitions which means to use the color of the style."

That helps indeed! Not!"

I thought the definition was very clear. Why exactly is it not helpful? Are you saying there is no way for a program parsing an OOXML file to determine at some point in a document what the current style is, or is the problem to determine the color associated with the style? Just asking…

Reading that example, it does strike me as at least poorly written. Firstly, "phClr" isn’t a colour, it’s more like a pointer to a colour. Secondly, referring to something not previously mentioned in the paragraph simply as "the style" requires people to guess which style is being referred to (And Murphy’s law guarantees that people will gleefully pick the wrong style). It would be better written as:

"phClr (Style Color): Used in a theme definition to indicate which colour the theme should use. This value indicates that the theme should inherit its color from the style in effect at the point where the theme occurs in the document."

It sounds like it’s not just poor writing style that you’re complaining about though. Could you explain what information needs to be included and why you, as a developer, find it harder to do your job because it isn’t included?

"phClr" is not defined in the theme. If it were <schemeClr val="accent1">, which is a valid value, then it would make more sense since all themes include an "accent1" color (it’s one of the basic colors a theme is based on).

But "phClr" has no definition. And with the schemeClr being wrapped within a style that has no color, there is no way you can possibly infer a color.

A comment from the US ISO national body : "I haven’t had time to do extensive analysis of OOXML, but I gather from the things that Patrick has told me that it appears to lack conventional XML structure. That would make sense if the data in the interchange file was expecting to be interpreted by a piece of software (i.e., Word) that has the capability to instantiate an implicit structure just from seeing styles attached to pieces of data. I’d love for someone from Microsoft to elucidate the matter."

This example Stephane just gave is a good reason why there is a need for a technical OOXML blog. Developers and implementers can ask questions like: what style did you mean, how can I find out such-and-such, etc. Microsoft and ECMA can use the questions in such a blog to find out what parts of the current version of the spec need clarification or expansion or changes in a future version.

This doesn’t mean it’s not XML, it just means it follows a different model. That’s the whole reason we created our own format rather than use one of the ones out there. We already tried that with HTML and it was just a mess.

Our model is a flat structure, but it’s still all XML, just different XML (with the few caveats aside like field codes being strings, etc.).

Now, to the XML you are referencing above as being problematic…

The enumeration value "phClr" is a value that can exist for the "schemeClr" element. If there is no style color specified, then you would just ignore this value. It’s a pretty simple check. It’s just saying that if a style color is present, use it.

I guess more information could have been provided, such as specifying that people shouldn’t use this enumeration if there is no style color specified, but that seems pretty obvious.

Again, if you think the specs are underspecified, you should go back to the ODF specs and take a look. There are gaping holes in that spec and it was approved by ISO last year.

And for the suggestion of seperate blogs…

Hey, nothing would please me more than to get off of these political discussions. But when you’re having to endure the smear campaign that we’re seeing against Open XML, a lot of the topics need to focus on helping to fight that FUD.

At this point the rest of the development team is focused on Office 14. I’m actually spending far more of my time on this issue than I would like, and as a result I’m not able to spend as much time with my team helping them get their specs together, etc.

That’s why folks like Doug Mahugh have stepped up, as they focus on helping produce content for developers. That’s also the point of the OpenXMLDeveloper.org community site.

I try to produce dev focused content too, but it’s hard when we keep getting off on these tangents arguing the same point over and over.

Have you considered setting up a blog? Posts like your last one are important and useful, but injecting them into a thread in the middle of a discussion just forces everyone to do needless context-switching.

Brian,

It sounds like you’re saying that Microsoft has a tradition of document representation built on an entirely different philosophy to what you could call the DocBook tradition, and that this is something consistent through many (all?) Microsoft products. If so, what sort of documentation is available that explains the philosophy behind about the Microsoft tradition, as distinct from any one application of it?

It’s not about Microsoft, but instead the Word application on which the WordprocessingML format was based. DocBook was based on applications that were more for book authoring, etc. (chapters, sections, etc… basically a lot of structure and heirarchy). The ODF format was based on StarOffice Writer, and is somewhere in between the DocBook approach and the WordprocessingML approach (probably most similar to HTML).

That sounds really useful. I suggest you make a point about how this is a different, but equally valid, philosophy of what XML is about. It seems to me that this issue explains why so many people complain about OOXML being a collection of angle brackets that doesn’t follow the deeper design of XML: their concept of the "deeper design of XML" is actually "the DocBook tradition of XML", and since it’s never occurred to them that another tradition could exist, all they can see is really badly implemented DocBook.

"It seems to me that this issue explains why so many people complain about OOXML being a collection of angle brackets that doesn’t follow the deeper design of XML: their concept of the ‘deeper design of XML’ is actually "the DocBook tradition of XML"

IMHO, Andrew, their ( and my ) concept of the ‘deeper design of XML’ is to keep low the barrier to understand things ( this is one of the goals of XML ) and apply "common sense" to represent document structure.

After all, this is not rocket science, or is it? ( rocket science is the work that @stephane and others must do to decipher the binary and OOXML formats to get a decent implementation )

If you develop a document format for your own benefit ( or minor partners that achieve partial implementations ) you get one "kind" of format ( i.e: OOXML ).

If you develop a format in everyone’s benefit you get other kind of formats ( HTML, ISO-ODF, etc.).

Examples: compare this [1] with this [2] experience of a qualified[3] expert in XML

Brian said "This doesn’t mean it’s not XML, it just means it follows a different model."

No, what the comment says is good luck implementing it (ten years, starting from now), good luck interoperating with anything non-Microsoft out there. That’s a variant of explaning why the "backwards compatibility" claims is such a bunch of BS, why you are using it not to preserve customers base, but to preserve the jewels (cash cow). I would have expected engineering of the 21st century in that new "XML". That’s what the comment says.

Brain said "Our model is a flat structure, but it’s still all XML, just different XML (with the few caveats aside like field codes being strings, etc.)."

I obviously don’t buy it. It is not XML at all. Good XML is meant to allow deterministic and simple implementations (parsers,stacks,tokenizers). OOXML is the negation of all that, because all you did was put angle brackets on a very old format that was thought to be a binary representation (for performance reasons).

Brian said "The enumeration value "phClr" is a value that can exist for the "schemeClr" element. If there is no style color specified, then you would just ignore this value. It’s a pretty simple check."

I am not arguing that’s simple or not. I am saying it’s not in the specs. Give me a break. I am only giving one example. The under-specification is everywhere in the specs. This is what you have to fix.

Brian said "Hey, nothing would please me more than to get off of these political discussions."

Give me a break again. This is a technical discussion from a real implementer. This is where you get ridiculous and very Jason Matusowish.

Brian said "Again, if you think the specs are underspecified, you should go back to the ODF specs and take a look."

My comments above has proven otherwise. Styles are clearly, deterministically and cleverly designed. Everything is in Chapter 14, and I can support styles in 19 XML-based OpenOffice formats. The best part is that I don’t even have to care if the style is used as part of a word processing or spreadsheet or … since the same serialization is shared across all applications. This is one example of people who have designed a format that 1) defines the stuff 2) puts the barrier to entry low 3) maximizes interoperability. Microsoft is very very very far from such principle. The more I think into it, the more word "Open" in "OpenXML" sounds disgusting. (besides that OpenXML implies there is a closed XML, which is how ridiculous it gets when Matusow/Mahugh start calling people anti-Open XML lobbyists).

Brian said "At this point the rest of the development team is focused on Office 14."

I know that. We would not be having this discussion if Microsoft 1) had done a much more qualified specs 2) was not out there buying their way to ISO. It’s a behavior like this that is causing you troubles.

Brian said "That’s also the point of the OpenXMLDeveloper.org community site."

It’s not true. This site is dead. Questions are not answered.

Brian said "it’s hard when we keep getting off on these tangents arguing the same point over and over."

It’s not true. Those examples I gave are all new. The argument I can do without.

Andrew said "It seems to me that this issue explains why so many people complain about OOXML being a collection of angle brackets that doesn’t follow the deeper design of XML: their concept of the "deeper design of XML" is actually "the DocBook tradition of XML", and since it’s never occurred to them that another tradition could exist, all they can see is really badly implemented DocBook."

It’s the other way around. When you take a look at Excel’s file format until now, this was a serialization (BIFF) that went against human nature. To give you an idea of the magnitude of how bad it is, you have to write 10 KLOCs instead of 1 KLOC for just about everything. Moving to so-called XML was the opportunity for Microsoft to fix it, while still preserving the features. That would have certainly been hard work, but it was WORTH it, that was what would have made Microsoft the company that they think they are. Instead, they are only preserving their cash cow. I find "open" and "xml" not very meaningful when it comes to OOXML. That’s what makes the whole move ridiculous in the first place.

You talked about "common sense" and "human nature". If you’re talking there about things like declaring metadata before the data it applies to, then that’s what I’m referring to by "the DocBook tradition", and it’s something that seems to be as unnatural to the Word team as the Word tradition is to you.

You can argue that the Word tradition is inefficient, or that traditions have a network effect, or that Microsoft should have built up the level of knowledge in the community long ago, but you need to agree on definitions of terms first, or people in the other camp won’t understand what you’re saying.

This "tradition" allows Microsoft to push vendor-specific stuff such as dates which in turn destroy interoperability.

ISO mandates include interoperability.

Isn’t there a contradiction?

It’s dishonest not to put "Microsoft" or something that makes it clear that it’s a vendor-specific stuff being pushed. It cannot be called "OfficeOpen XML", it cannot NOT include Microsoft in the title.

Stephane, the spec directly calls out that it started from Microsoft’s formats and that it was designed with that backward compatibility in mind.

Also, let’s be clear on things that truly break interoperability as opposed to things that are either awkward, are at least mildly painful to deal with.

The date issue for one does not in any way "kill interoperability." It just deals with stuff differently. As I said earlier, we tried the other approach with our earlier version of SpreadsheetML, and it actually was a huge problem. There were formulas errors that we couldn’t predictably correct, and performance was significantly impacted. So we went with the other approach, and it’s fully specified how to deal with it.

Now, can we please try to start making some actual progress in these discussions? If you’re just set on hating Open XML, and regardless of what I say you’ll constantly repeat that it sucks and shouldn’t be considered as an ISO standard, then just stop. I get it. Everytime I try to explain a problem you have, you move onto another issue, or pop back to a higher level and just complain about the spec as a whole. We make no progress as I can’t tell what you’re opinion was on that specific issue. I understand you’re meta point. I disagree, but unless we’re going to actually try to make some progress, I’m done with this discussion.

I assume the "dates" issue you’re talking about is the 1900/1904 stuff. I’ve been trying to avoid arguments based on evidence rather than logic, but in this case I’ll make an exception because I think it reveals a more fundamental issue.

The nearest I’ve heard to an explanation of why OOXML uses such an odd date system comes from an aside in a Joel on Software article ( http://www.joelonsoftware.com/items/2006/06/16.html ) and a (strong, IMHO) criticism in the blog of a KDE developer ( http://www.kdedevelopers.org/node/2834 ). In short, they grandfathered in a bug in Lotus 123 back in the stone age, and feel that backwards compatibility is more important than forwards compatibility on this one.

I think this reflects another philosophical difference between the ODF and OOXML communities: OOXML proponents feel that "interoperability" means "backwards compatibility first, and forwards compatibility if and only if it can be done without breaking backwards compatibility", whereas ODF proponents want backwards compatibility if and only if it can be done without making life harder in the future.

On the question of requiring OOXML to be relabeled "MS OOXML", that’s an intersting idea, but I disagree. Microsoft claim to be trying to change their business model from that of the standard-bearer for the trade secrets approach into a more community-friendly model. You may say that they’re just the same old wolf trying on sheep’s clothing, but personally I prefer to give them the benefit of the doubt where I can do so without putting myself in danger of getting eaten.

As with anyone trying to make a change in life, it’s best to praise loudly Microsoft’s good behaviour and not tie them too strongly to past misdeeds. Sending OOXML to ISO is a really big step for them, and they deserve credit for trying something new. If you feel that it’s not enough, by all means say so, but do it in a way that emphasises what they have to gain by doing it your way in future.

Brian said "the spec directly calls out that it started from Microsoft’s formats and that it was designed with that backward compatibility in mind."

I agree that you guys took the easy road, which is to put angle brackets on the old stuff, and that the way you describe this is "backward compatibility". I also note that this is far from being the truth, for instance the chart drawing engine and there is a ton of casualties (besides that the mapping that is not documented).

Brian said "let’s be clear on things that truly break interoperability as opposed to things that are either awkward, are at least mildly painful to deal with."

awkward for who? Not for you I guess, the Office product group has implemented this stuff. Think people outside. You can’t push 6,000 pages to ISO only because you can store custom XML somewhere in the package, right? (that was already possible before, just in an OLE stream instead).

Brian said "The date issue for one does not in any way "kill interoperability." It just deals with stuff differently. "

Again, it’s the armchair position you’re having here. I don’t understand that you don’t understand the issue. For everybody outside Microsoft fence, not having an interoperable date type means that it is virtually impossible to manipulate dates across heterogeneous environments. And that’s one example, only ONE example.

Brian said "it actually was a huge problem. There were formulas errors that we couldn’t predictably correct, and performance was significantly impacted."

I’m sure you realize the hypocrisy here. The Open source guys out there move mountains to deal with those problems daily. It’s not in Microsoft’s mono-platform culture, you have to admit it. Believe it or not, that’s fixable just be flagging the dates accordingly in an attribute for that. And that’s one of the great use cases of XML. Finally one!

Brian said "it’s fully specified how to deal with it."

Again no. The date type being used is the OLE-date type, and it’s both platform and vendor dependent. (I can provide you the WIN32 OLE API that were designed to deal with this date type). This just has to go, it’s a show-stopper.

Brian said "Now, can we please try to start making some actual progress in these discussions?"

I realize this is fairly convoluted a discussion, and I’m sorry for that. Just I can’t fathom that if I (or somebody else), you won’t bother answer, and I give an example, you dismiss it anyway. There is no way we can make a progress. And yet, I thought that with the bold claims you made for now three years about those file formats, only one counter-example is needed to show you wrong. Anyway, this does not get anywhere apparently. With the <scheme val="phClr">, have you noticed the same conclusion I have on this : that, with an EMPTY Word 2007 document, you can already find under-specified attributes in the XML. From that on, what is the logical conclusion when you start adding content?

I’ll say it again. My estimation for a proper specification is 600,000 pages. 100,000 for each of the 3 main formats, plus 100,000 for MSO/VML/DrawingML. And then 200,000 for the non-XML parts.

Brian said "If you’re just set on hating Open XML"

I am sure you realize I sell and support two extremely advanced products related to those formats. How could I hate it?

In a way, if everything was documented, then perhaps I would have to do reverse engineering elsewhere. So you get the idea, what you say does not make sense. What you should understand is that I may not be as passionate as you are with this stuff, __it’s not my baby__, and yet I could shut my mouth to avoid ruining my online reputation. Only, I think there’s a fight for quality specs, and I can’t swallow the ISO thing.

Could you explain what you mean by "interoperability" in this context? I’ve tried two different definitions above, but it doesn’t seem like you’re really using the word in either of those ways – which is not to say that your definition is a bad one, just that I need to know what it is before I can comment on your position.

That said, it does seem like your definition of interoperability encompasses something like what I’d call a "level playing field" requirement – that it should be equally easy for everyone to understand the format, regardless of their background. If so, that suggests the question: are you arguing that the ISO should reject OOXML altogether, or put it into the long process, so that these issues can be ironed out? I would expect a long ISO process to give open source programmers enough time to grok the ideology behind OOXML and move all their mountains into position, and would give the ISO enough time to iron out all the important issues that you’ve been raising.

Finally, I’ll apologise in advance for being presumptuous, but it feels like you’re a bit disoriented lately after spending a year getting used to having a brick wall to shout at, then suddenly having that wall taken away. It reminds me in a way of that experience we’ve all had where a user has been suffering a painful system that makes their lives a misery, and (to their eyes) we’ve been able to wave a magic wand and make it go away. Of course, we know that the "magic" is actually just sufficiently advanced technology, and modern debating practice is a very advanced technology.

Over the last two days, I’ve tried to showcase some of the techniques that trained debaters use, like agreeing definitions of terms or developing and executing arguments like you’d develop and execute an algorithm. And to your immense credit, you’ve all done what good programmers do: jumped on new technology that lets you work more effectively. I guess my point is, when you find yourself shouting at a brick wall, check for bugs in your debating procedure.

Andrew said "it feels like you’re a bit disoriented lately after spending a year getting used to having a brick wall to shout at"

I have no idea what you are talking about.

I’m in this business for almost a decade, and if there’s one thing obvious, it’s that the new angle brackets regime changes nothing to the huge time spending reverse engineering this stuff. In short, SpreadsheetML = BIFF12.

I call out Microsoft for doing the following 1) slutty title, that turns anything "open" irrelevant now (the Novell deal is just that, FUD) 2) pushing to a reputable organization something that is not adequately documented (600,000 pages) 3) missing the opportunity to come up with modern XML (simple formats thought from the ground up to be programmable and interoperable, easy to implement, discoverable XML, …)

Not sure they are super productive in moving the conversation forward, but I appreciate them nonetheless…"

With an _open mind_, they could be "super productive". The links don’t point to a troll but to an expert opinion.

I believe MS is entering a new ( and unknown ) environment: open formats ( and to some extent, open software ). You will have to *listen* this kind of arguments and not to take them as just a "anti-OOXML" FUD, if you really want to make pervasive this formats in governmnents around the world. There are many ISO national bodies ( NB ) raising the same kind of arguments.

Yes, Microsoft has to learn how to understand a wider range of views, but everyone else has to learn how to speak in a language they can understand too.

For example, appeal to authority is a weak argument at best, and only works at all if you’ve both previously established respect for that authority. When Brian says "it’s alright for OOXML to do such-and-such, because ODF does the same thing", he’s saying that because he’s talking to people with a previously-stated respect for ODF. If you can find Rick Jelliffe or Doug Mahugh claiming distaste for OOXML, I should imagine Brian would be much more receptive.

I read the articles you linked to, and to be honest I’m not sure what point you were trying to make. I guessed before that you were trying to say that traditions have a network effect (XSLT and similar tools require DocBook tradition XML), but I could be wrong about that.

Implementing an argument really is like implementing an algorithm: you take certain inputs ("premises"), perform a series of actions on them ("deductions") and produce an output ("conclusion"). If your compiler ("Brian Jones’ brain") spits out an error message, you need to fix the bug, not order your compiler to fix it for you.

Again I’m happy you posted the links in terms of spreading more knowledge. I just wasn’t sure how to respond to them in terms of this dicussion. I believe I actually linked to Bob’s post about "word’s awful XML" a few weeks ago and had some comments on it.

I’m sorry if I implied that I don’t appreciate people sharing information… that wasn’t what I meant.

Stephane,

I’m sorry I didn’t reply to your second to last comment. There were some good pieces in there I want to address. I’m going to think of a way to write a post that goes into those things.

The short of it though is that I’ve never tried to imply that we created the ultimate XML file format. We created an XML file format that could meet a specific need (backwards compat). Rather than keeping our formats closed and proprietary we fully documented the format and then gave it away. I would have loved nothing better than to have started from scratch and designed the ultimate document formats but that’s just not realistic given the fact that 99.9% of our customers that use the product just want things to work.

Andrew

Thank you for helping to facilitate the discussion. 🙂

I try not to get too defensive, but I’m sure that doesn’t always come through.