So - let's get down to what was the real purpose of going to Redmond - apart from the great breakfast I had at Lowell's in Farmer's market in Seattle - to test the pre-alpha version of Microsoft Office 2007 SP2 and its ODF-support.

(let me start by appologizing for the late post, but I lost my USB-drive with my test-files on, and I didn't find it until a few days ago)

I have already listed some of the findings of the day in my previous post, so I'll try to get into more detail here.

What did I do?

Well, we would have some hands-on time with the latest build of Microsoft Office 2007 SP2 (apparently directly from a developer's machine) so I brought a bunch of documents I have worked on before - some of them was from the application interop-work I participated in in Fall 2007 for the Danish National IT- and Telecommunication Agency. Others I have created myself. I performed the following steps for each file:

Load the ODF-file in OpenOffice.org 2.4

Create a PDF-file of the document using a PDF printer driver (CutePDF)

Load the ODF-file in Microsoft Office 2007 SP2

Do a "Save as ODF" and prefix the original filename with "MSO". According to the Microsoft project managers I talked to, this would ensure I actually saved a version of the ODF-file that had been processed by the internal object model of Microsoft Office 2007 SP2.

Create a PDF-file of the document using a PDF printer driver (CutePDF)

Below I have listed for each document the following data:

Original file: somefile

Original file

New file

Generator: SomeApplication

PDF

Generator: Microsoft Office 2007 SP2

PDF

For each I will include some tech remarks on interesting subjects - if any.

There are a couple of things to note on a general level before we get started. Microsoft has chosen to follow implementation of ODF "by the book" in the sense that they have not looked so much about bugs or "features" in competing applications. This has the peculiar effect that perfectly legitimate ODF-files produced by Microsoft Office 2007 SP2 might not properly in competing applications. For more general ideas of what they did, you should check out Dennis Hamilton's post from the workshop. It is by far the most comprehensive of the ones posted since last week.

Remarks

This file is an ODT-file with an embedded ODS-spreadsheet. Loading this file into Microsoft Office shows a nice red cross and no spreadsheet. An inspection of the ODT-file shows that the content is pretty much preserved including the embedded ODS-spreadsheet. But when looking at the manifest file, the following appears:

It is the location of the graphical representation of the embedded spreadsheet. The media-type seems to be an old StarView Metafile format (confirm, anyone?) and Microsoft Word doesn't understand this image format - hence the red cross. This example highlights one of the points of bad interoperability: Small errors can cause big problems. Everything but the missing image is preserved, but the document becomes useless regardless of this "small" error".

Remarks

This file is included in the "Self-assesment"-package from the Danish National IT- and Telecom Agency. Loading the letter into Microsoft Office 2007 initially appears to produce an identical file, but even though the content itself is preserved, there are still areas with problems.

There is a border around the logo image in the header

The height of the header is not completely preserved

The "right margin" (which is really a stretched text box) is gone since the text box is wrapped around the text instead of being preserved in its full length

Page numbering is gone on the last page

A funny note: if you load the file generated by Microsoft Office 2007 in OOo 2.4, it loads perfectly fine as the original document. This suggests that the problems encountered by loading it in Microsoft Office 2007 are not problems with converting ODF to the internal object model of Microsoft Office 2007 but instead problems in the layout engines.

Remarks

This is another document from the Self-assessment package. It contains a few different features; a TOC, colored text, text boxes, a drawing, an embedded spreadsheet as well as some change-modification. This generated document is kind of messy. The content has been "shuffled" around and again we have the problem with Microsoft Office 2007 SP2 not understanding the GDIMetafile image format. The embedded objects are fine themselves - the graphical representation of them is not.

Remarks

This file is another one of my own files that I have created earlier. It contains a mathematical formula in MathML. When loading it in Microsoft Office 2007 SP2, the mathematical formula simply dissapears. I am kind of lost on the reason for this. It is not the DOCTYPE-declaration used by OOo (see next file for those details) so maybe it is the construction of my ODT-file that poses an issue for them.

Remarks

This file is almost identical to the one above - but it is generated by OOo 2.4 instead of me and carries all the styling and configuration that comes with it. Here the file and the mathematical content loads just fine. But an interesting thing happens when saving it again. The MathML-fragment is slightly altered from

The clever reader will notice that the semantic annotations used by OOo are removed from the MathML-fragment. The MathML is in general altered a bit, but it is not that big changes - most of them are visual things related to styling. The problem is that this MathML is un-consumable for OOo. The MathML-fragment produced by Microsoft Office 2007 SP2 is valid MathML (validated using Amaya) and even though I add the required !DOCTYPE, it still won't load in OOo.

Original file: Testfile_13.odt

Original file

New file

Generator:OpenOffice.org/2.0$Win32

PDF

Generator: Microsoft Office 2007 SP2

PDF

Remarks

(file has been removed at the request of the originator of the file )

This file is a bit more complex, and as with Testfile_08 it consists of a lot of different parts. Key issues here is failure to read GDIMetaFiles, borders around images, errors in visual presentation of numbering/bulleted lists and lines being much thicker than in the original file. There is really nothing new in this file - just that it confirms the problems identified with Testfile_08.

Remarks

This file is one of those template-files that are used a lot almost everywhere. You know, someone has created a "standard" document with correct header, footer and images, and this file is then distributed in the organisation. The conversion is actually almost error-free. There is a slight error with respect to border around images and rendering of them, but that is just about it.

Remarks

(both PFD-files have been created by OOo 2.4/Win32)

I created the file above to illustrate what would happen when working with spreadsheets. I used the infamous CEILING-function, but I was at that time not aware that Microsoft Office 2007 SP2 would throw out formulas from "unknown namespaces". Hence there is very little change - only the visible number of decimals after having been through Microsoft Office 2007 SP2 has been reduced to two. If you look in the XML generated, you will find one interesting thing, though:

Conclusions

Well, the investigation above was done based on about 20 files tested and they were primarily text documents (and one spreadsheet). Some of them was created by me and some were created by various parts of the public sector in Denmark. I have only looked at about half of the files, but a few other files are also available shold you wish to play with them yourself. You can get them here: public.zip (3,02 mb).

Validation

I have made some effort to validate the ODF-files generated by Microsoft Office 2007 SP2. What I have done is to download the RelaxNG ODF 1.1-schemas from OASIS' website and I used JING to perform the schema-validation. Since there is a known bug in the schemas I have used JING with the "-i" flag set. Validating the structure of the package itself is a bit tricky (as reported by Rick Jellife) and I have not done that. I have done a schema-validation on the files "content.xml" and "styles.xml" based on the thought, that these are the most complex files in the package. The result of the validation is that all files generated by Microsoft Office 2007 SP2 are valid ODF 1.1-files. I piped the result of the validation into an output file available here for your viewing pleasure: output.txt (1,92 kb).

All in all I think Microsoft has done a pretty good job. Obviously there is still some way to go until it reaches production quality, but I was pleasantly surprised to see the big difference in conversion results compared with the results of the ODF Converter from SourceForge.net I have worked with earlier. There are a couple of things I would like to note, though:

Graphical representations of embedded objects

Microsoft Office 2007 SP2 has problems with reading the graphical representation of embedded objects if the file is created by OpenOffice. It seems that it simply doesn't support the GDIMetaFile-format used by OpenOffice (and its derivatives). I think the "nice" way to solve this would be to load the object (if supported) and render an image of it again. The dimension of the image is available in the <draw:frame>-element and could be used to determine the size of the image.

Embedded objects

I noticed that handling of embedded objects are done using a "don't touch"-approach, which means that when loading an ODF-file with an embedded object, the embedded object is simply copied and not touched by Microsoft Office 2007 SP2 (if they are not activated by the user). I think this is a good approach. Consuming applications should respect the "integrity" of the consumed package and not alter its content unless it has to.

mimetype

A funny little thing: The mimetype-file in the ODF-package is created using CAPITAL letters, i.e. the file will be called "MIMETYPE". This causes the OpenDocumentFellowship validator to fail since it cannot find the file (with non-capital letters). I have suggested to Microsoft to generate the file using non-capital letters to enhance interop and validation across platforms where some are "a bit more" case-sensitive than Windows.

config settings

Microsoft has chosen not to use the configuration elements otherwise to widely used by Lotus Symphony and OpenOffice.org . I am not sure if I think it is a good or a bad idea, but since they do not use the settings.xml-file at all, they should remove the file completely.

The public review starts today, 7 August 2008, and ends 22 August 2008. This is an open invitation to comment. We strongly encourage feedback from potential users, developers and others, whether OASIS members or not, for the sake of improving the interoperability and quality of OASIS work.

The document is available http://docs.oasis-open.org/office/v1.0/errata/cd02/ for those of you wanting to take a peek or contribute to the work. The document consists of 88 corrections to the text with references to both the OASIS-edition as well as the ISO-edition of ODF 1.0.

I am a bit unsure if the Japaneese ISO-NB defect report as well as the comments from the original ODF ISO ballot are included in this errata - maybe they will not be dealt with before ODF 1.2 hits ISO (possibly) sometime this Fall.

A new study from the University of Illinois College of Law has made its way to cyberspace. The title is "Lost in Translation: Interoperability Issues for Open Standards - ODF and OOXML as Examples" and is done by Rajiv Shah and Jay P. Kesan. The study takes a rather novel approach compared to the debates that have been raging through the last year or so: Is the choice of a(ny) document format a silver bullet for interoperability?

The answer in the paper is a clear "No". When discussing the various interop-studies internationally, they note

While it is widely acknowledged that there are problems with interoperability across different formats, e.g., going from ODF to OOXML, there is an assumption here that all implementations produce the same ODF or OOXML.

Their conclusion is that this is not the case. What they did was to create a number of test documents using the reference implementation for each format, OpenOffice.org for ODF and Microsoft Office 2007 for OOXML. They then opened these documents in other applications supporting these formats.

The results are rather interesting:

Results for ODF

Implementation

Raw score

Raw score Percentage

Weighted Percent

OpenOffice

151

100%

100%

StarOffice

149

99%

97%

Sun plug-in for Word

142

94%

96%

CleverAge/MS plug-in for Word

139

92%

94%

WordPerfect

122

81%

86%

KOffice

121

80%

79%

Google Docs

117

77%

76%

TextEdit

55

36%

47%

AbiWord

48

32%

55%

Results for OOXML

Implementation

Raw score

Raw score Percentage

Weighted Percent

Office 2007

148

100%

100%

Office 2003

148

100%

100%

Office 2008 (Mac)

147

99%

99%

OpenOffice

141

95%

96%

Pages

142

96%

95%

WordPerfect

114

77%

84%

ThinkFree Office

101

68%

83%

TextEdit

52

35%

43%

They further conclude that

The final implication stems from the surprisingly good results for OOXML implementations. Critics of OOXML have argued that it was too complex and difficult to implement. While OOXML is a long and complex standard, it is possible to offer good compatibility. In fact, our results suggest that implementations of OOXML work as well as implementations of ODF. At the level of basic word-processing that we examined, neither standard had a dominant advantage over the other in terms of compatibility scores. While ODF has had a head start that has lead to more implementations, there appears no reason why OOXML cannot catch up. After all, several developers have provided independent implementations of OOXML.

... which should be interesting for those mandating usage of (an open) document format.

If nothing else this study highlights a couple of very interesting points:

You don't get good interoperability simply by choosing an open document format

Interoperability still has a long way to go and there is still a lot of work to be done.

The last part of the afternoon in Redmond was a round-table discussion of standards in general; what to do with them and how to work with them in terms of handling interop with other vendors implementing the same standard. It was really interesting and it was clear that Microsoft wanted to hear our input. Everyone in the Microsoft Office "Who's who"-book was there to participate and we had a good couple of hours debating the issues at hand.

One of the really interesting guys I met there was John Head aka "Starfish". He is a Microsoft partner as well as an IBM business partner, and he really grilled Microsoft with respect to some of the decisions they had made around how the UI behaved. You should check out his thoughts on his own blog. It was clear that he had some leverage in relation to Microsoft - even though I did not agree with everything he said.

An interesting topic was application interop. If you ask me, interop is based on standards but carried out by applications - in other words, standards do not give good interop simply by themselves. This idea was really confirmed when we talked about a thing John also mentions - how do I handle bugs in other applications? I think that it was Peter Amstein that noted that an example of this was the 1900-leap year problem where a decision made 20 years ago still haunt them. I couldn't agree more. But a similar example is application-specific extensions. ODF has this wonderful (read: awful) concept of "configuration item sets". These are specified in section 2.4 of ODF 1.0 and the usage is intended to be to store various application specific settings. The problem with these elements is that there are really no restrictions to how to use them. So you will end up with an application like OpenOffice.org 2.4 that puts data like this in the section:

So you now have OOo (and also Lotus Symphony but to a lesser degree) put in all these settings that not only directly affects the visual layout of the document but - in terms of e.g. the "UseFormerLineSpacing" - specifices that an application should behave as OOo 1.1 . These are really "OOo Compat-elements".

Now, the question is, what should other vendors do with these "extensions"? Well, Microsoft seems to be under a lot of pressure from organisations like the European Union to implement ODF strictly by the book, so they have chosen to ignore them (and other knowledge of bugs) completely. If you look at the settings.xml-file they actually strip it completely from content and do not use it themselves. Another example is mathematical content in text documents. As I documented some time ago, OOo has a bug requiring the MathML-fragment to include a !DOCTYPE-declaration - otherwise OOo will not display the math content. The result is that ODF with math generated by Microsoft Office will not load the math in OOo due to this OOo-bug. Is the approach chosen by Microsoft the right one? I think so for the following reasons:

Otherwise the result will be an endless propagation of these settings where each implementation will need to support each and every setting from all other vendors

I agree with John Head that it is good to put some pressure on OOo. It has for a long time been living relatively "low-key" in terms of critism and market pressure and it will be good for all of us to have the quality of the application be enhanced.

Will this hurt interop? Yes, of course it will ... but I still think it is the right decision.

Another interesting thing we discussed was extensibility - how applications should/could extend a standard. This was one of the topics where it seemed that I was dissagreeing with almost the entire room. What we talked about was: What do an application do with content it does not understand? Both ODF and OOXML have mechanisms to extend the document format with foreign namespaces etc, and I got the impression that most implementations simply remove content they do not understand when roundtripping documents. Microsoft has chosen the same approach and the argument they made was that it was imposed on them by their "Thrustworthy computing"-guys since preserving non-understood data could be used to hide sensitive information in documents. Even though I see the problem, I still think the argument is wrong. There are tons of other places and ways to hide information in a document and I'd prefer to have the unknown elements and attributes preserved when roundtripping.

Wow - it seems the entire cast of Microsoftbloggers are laying the grounds for a whole bunch of blog-entries after the DII-workshop in Redmond the day before yesterday (Wednesday July 30th).

So OK - I'll bite.

I am again back in Denmark after a Hell-ish evening after the workshop where jet-lag almost had the end of me. If the cap-driver had actually known the way to my hotel (he didn't so I ended up giving directions in a town I had only been in for about 36 hours in a country I was only visiting for the second time) I am sure I would have been catching Z's big time on my way home in the cap ... at 20.15 in the evening.

All in all it was very interesting to attend the workshop in Redmond. The day started up with an introduction by Peter Amstein about the approaches and decisions Microsoft had done when working on their ODF-implementation. One of the more interesting discussions here was what Microsoft had done where there was differences between ODF and OOXML - should they be conservative or creative? An example of this was numbering/bulleted lists (it is one of the key parts with differences between ODF and OOXML).OOXML has the possibility of having each bullet in a list a seperate colour where ODF does not. Microsoft had chosen the conservative approach and simply removed all colours from the bullets, but Patrick Durusau noted that he thought it was possible to use styling to do it in ODF - only downside was, that to his knowledge, this particular way of doing it was not supported by any ODF-implementation. I guess some times you are screwed regardless of what you do.

After this followed various project managers that demonstrated how their part of Microsoft Office supported ODF and talked about the remaining work. They each had complex documents that they showed us their work on and the result of saving them (OOXML-files) to ODF. I remember working with conversion using the SourceForge translator in Fall 2007, and I was really impressed by the fidelity of the conversions we saw.

Key points from this part was

Shapes are converted without much loss of fidelity

Metadata is converted (Dublin Core)

Fields are converted

Headers/footers are converted

TOC etc are converted

Shapes are converted

SmartArt is converted to shapes

Images are converted including cropping etc

New 3D-shapes are reduced to the closest possible OpenDocument Drawing shape

Old 3D-shapes are converted

Formulas in spreadsheets are implemented by the ECMA-376 specwith Microsoft's namespace

Mathematical content is converted from/to OMML and MathML

When loading an ODF-spreadsheet

Tables in OOXML-presentations are converted to shapes thereby making a "virtual table" since ODF does not support tables in presentations

Conversion of embedded objects is not fully supported

CustomXML is converted to "flat Xml" with content controls being discarded.

When loading a spreadsheet from ODF with formula-namespace other than Microsoft's, just the values are being converted and the formulas are disgarded

Animations are converted in presentations

The concept of "master pages" is not converted to ODF

The rest of the day consisted of hands-on labs (stay tuned for the tech-stuff from this part of the day) and a round-table discussion in the afternoon. I will talk about these part in a couple of posts in the beginning of next week.

So once again here I am – waiting for a connecting flight out of Frankfurt, Germany. There is about an hour and a half to my flight to Seattle where I will attend the Microsoft ODF DII-workshop about Microsoft’s implementation of ODF in Microsoft Office 2007 SP2. I am looking forward to seeing what they have accomplished and especially for the hands-on lab on Wednesday afternoon, where we will have the opportunity of testing our own documents to see the quality of the implementation. I have therefore brought my own little “tool-box” of documents to test in the lab. Since we were not required to sign any NDAs, I will try do document my tests as good as I can and post them on my blog ASAP. I hope they will be able to contribute to the on-going discussions taking place.

Of course, there are tons of different parameters to test, and it will be impossible for us to test them all, but a few areas do indeed deserve some attention, because these have traditionally been the areas causing most trouble. A non-exhaustive list would be

ODF-files with embedded objects

MathML

Spreadsheets

Presentations

Binary objects

Inline embedding of MathML-fragments (just for the fun of it)

ODF Drawing (vector graphics)

Numbering

Formulas in spreadsheets

Handling of anchoring of graphics and other document parts

Conversion from OOXML to ODF

OMML

DrawingML

VML

Embedded objects

XForms

Custom XML

clip-board content

The list above (apart from the latter three) nicely summarizes the problems we encountered when I participated in the work for the Danish Government (National Agency of IT and Telecommunications) in Fall 2007 about application interoperability between ODF and OOXML. I hope these tests will be able to contribute to the on-going discussions here in Denmark as well.

Another interesting thing to see will be how Microsoft has handled the various application-specific parts of ODF. How have they handled formulas in spreadsheets? How have they handled document protection of document parts? How have they handled the application-specific content in the <config-item-set>-elements of the other implementations? These are not trivial questions and they directly impact interoperability with other implementations of ODF.

I think it will be an interesting day tomorrow – and I’ll keep you posted on the progress. If you have any last-minute ideas and suggestions to what I should test, please write me an email or simply post a comment to this post. If you have files you’d like me to test, send them to me as well. You can use the “Contact” form on this blog to do so.

They say that parting is such sweet sorrow, but I beg to differ. Things have really, really cooled down in the otherwise warm and cozy OOXML/ODF-blogsphere. Rob and Arnoud seem to have gone back to their day-jobs and Brian has somehow completely dissapeared from the face of the Earth. Doug is mostly writing about what other people are writing about and Groklaw has gone back to their original angle - the SCO-Shenanigans. The only active blogger at the moment seems to be Rick, but even here, the normally so loyal Rick-bashers in the comment-threads seem to have gone AWOL.

Nothing seems to happen here in Denmark as well. The Danish NSB met about a week ago, and we decided to make the working documents public that formed the foundation of the arguments and decisions that took place in the last year. We formed a small technical sub-committee that did the technical work on first the responses to the Danish public hearing in Spring 2007 and later the responses from ISO to the Danish 168 comments to DIS 29500. The group consisted of CIBER Denmark, Ementor, IBM, Microsoft, ORACLE and the County of Aarhus. The technical group was an advisory group to the Danish SC34 mirror-committee. The working documents were made to allow us to keep up momentum and to document the progress we made. In short, for each meeting we made a list of the ISO editor responses that we could accept and those the we could not accept - and they were sent back to ISO editor for further processing. The documents are in Danish, but it still gives a good idea (regardless of native tongue) of what we did in the technical group and how we dealt with each issue. The documents are available at the Danish NSB website (last 7 documents at the bottom of the page in the section "Arbejdsgruppe-notater").

I have also more or less gotten back to my day-job as an Engineer with CIBER. I am currently investigating how to generate documents (ODF and OOXML) using .Net and is actually kind of fun. With that in mind I was interviewed for a video-cast by Microsoft for a small discussion about ODF and OOXML (they conveniently cut out the part where I said that I prefer the markup of ODF over the markup of OOXML but still prefer the tools for OOXML over the tools for ODF (for generating documents on e.g. a webserver or ERP-system), but what can you do?). One of the points I made in the interview was, that the tools were really important. If there are no good tools to create documents - it will slow down the adoption-rate of the particular file type. Regardless of SW-political view, the .Net-platform is rather large on a world-wide basis and the install-base of .Net-technology makes it a platform that should not be ignored (by size alone, if nothing more). And this puzzles me. If you look at the developer-hub of OOXML, you wil find libraries, scripts and tools for just about any operating system and programming language available. But if you want to generate an ODF-file using .Net technology - what do you do? Well, you will propably find that the only (OSS) library available is AODL, a project under the ODF Toolkit umbrella. Unfortunately, the project is not a priority of OpenOffice.org. I wrote an emails to the lead of the project (Dieter Loeschky from Sun) and he suggested that I joined the project as contributor. I have thought a bit about it, and I just might do so. I find it really important for the adoption of ODF that there are tools available for it, so if no one else will, I just might do it myself. I wonder if that will help everyone realize that I am a trueODF supporter.

And finally - the SC34 Ad Hoc Group 1 will convene in London in the end of July. We will meet and talk about what to do with both ODF and OOXML in the future. I am really looking forward to the meeting. The initial mail list reveales that there will be delegates from all over the world:

Country

#

Austria

2

Canada

1

Chile

1

Czech Republic

1

Germany

2

Denmark

3

Finland

3

India

4

Japan

3

Republic of Korea

2

Malaysia

2

Norway

3

New Zealand

6

United Kingdom

3

United States

2

I hope we will have a couple of productive days in London. As Alex Brown wrote about after the Oslo plenary in April 2008, transparency of the process is a key point and any input from you, dear reader, to how this could be achieved would be appreciated.

And finally-finally, I seem to have been struck by a bad was of "YABS" - Yet Another Blog Syndrome. Within the next few weeks I will begin blogging on the best IT-website in Denmark, Version2.

Some time ago I wrote a couple of articles about how to generate ODF-files as well as OOXML-files using .Net technology (both articles are in Danish). For generation of OOXML-files I used the - at that time - new .Net 3.0 System.IO.Packaging assembly and for generation of ODF-files I used AODL - a part of ODF Toolkit.

I thought it was time to refresh my skills - and share them with you guys - since the OOXML/ODF-debate has cooled down to a more relaxing level.

A few weeks back Microsoft released the first production-code edition of their OpenXml SDK - version 1.0. I will dig into this a bit later.

I thought I'd kick this series off with a couple of articles about ODF-file generation on the .Net platform, but I was unpleasantly surprised to realize, that it might not be as easy as it sounded. First, I was told that AODL was a dead project. Surely, the latest addition of code was in April 2007 and it seems that nothing has happened since. It looks as if the resources of ODF Toolkit is focused on ODFDOM - currently a Java-project. The problem is - AODL seems to be the only .Net-project available. I have stumpled across the ODF .Net project by IndependentSoft, but they sell an ODF library as Closed Source Software ... for (brace yourself) €999 a pop! Seriously - selling CSS-libraries is just sooo 2006 ...

And then I come to you, dear reader ... what the hell do I do? Do you know of other .Net libraries that allow me to create and manipulate ODF-files?

The beer-drinking was a huge success. We were only four of us, but we had a great afternoon and start of the evening. It was a lot of fun to meet offline and have a decent conversation on ODF and OOXML ... much more fruit-full than blogging.