Introduction

Microsoft is trying to push new file formats that are using Zip and XML. Are those new file formats any good for Office developers? In other
words, should anyone feel safe to make direct access to file parts, and start getting free of running instances of Microsoft Office and its COM object model, usually through VBA?

Microsoft does not run out of teasing. There is tons of videos, see here,
and here for example, screencasts, articles, and blog posts (self-serving Microsoft blog
posts mostly) about how much they are opening up. It boils down to the following excerpt from the Microsoft Office 12 introduction
white paper:

(...) The use of XML offers the benefits of greater transparency and openness
than were possible with the previous binary file formats. The new formats allow Office documents to easily integrate with existing and future line-of-business
systems, as the contents are now open and accessible. The new formats are also designed with long-term robustness and accessibility in mind. (...).

The binary file formats in use currently were designed in 1994—before the advent of XML and before the widespread exchange of documents and data that is
common today. These file formats, .doc, .xls, and .ppt, were introduced with the release of Microsoft Office 97, at a time when it was important to optimize
the files for storage on slow hard drives and "floppy" disks; it was not as crucial to focus on easy access to data within the files for better content
reuse, document generation, and seamless integration of the documents into business processes.

(...) The new XML-based file formats in these programs enable broader integration and interoperability between Office documents and enterprise applications.
Additionally, "Office 12" files are all wrapped using Zip technologies, which allows for easy access to the content parts as well as standard compression,
reducing file sizes and improving reliability and data recovery.

(...) Because documents stored in the Open XML formats are machine-readable and
editable by any text editor or XML processor, solutions need not use Microsoft Office programs to view or edit content within the documents.
Enterprise business solutions can access document contents easily and efficiently. Technology providers can utilize the Microsoft Office System and
Office authoring applications within their solutions, reuse Microsoft Office documents as other Office documents, or open and act on Office documents on
other platforms and in other applications.

They insist on the fact that, provided you make a valid use of the XML, pretty much changing the content of anything in an
existing document can be achieved by sequentially:

unzipping the content

making appropriate changes to one or more XML parts that are compatible with the provided XML schemas and open packaging relationships

Now open the file in Excel 2007. You get a blocking error message which says:

Excel 2007 cannot open the file we have manually updated

Followed by another even more frightening message:

According to Excel 2007, the problem is that the calculation chain is now corrupt

Interestingly enough, we thought parts of a spreadsheet file were individually updatable as long as we did not touch elements that are,
according to the ECMA 376 documentation, indexes to other parts.

Now that's an interesting issue. The ECMA 376 documentation says that the calculation chain is the graph of formulas, sorted with respect to to their dependencies.
Suddenly, the little change we would like to make looks way more expensive. Rebuilding the graph of formulas ourselves is what Excel itself does, and sure
enough it involves parsing the entire spreadsheet, discovering formulas, and applying formula parsing algorithms to induce a graph of the dependencies.
It certainly sounds like we are going to have to rewrite a portion of Excel itself. Not every employer can afford waiting that much...Or perhaps that's Microsoft's way
to tell us: you shall not touch this without our approbation.

We can perhaps get rid of the calculation chain, if we carefully delete the part and associated relationship. But then, there are three problems:

Parts are intertwined together through implicit and explicit relationships (relationships are separate Zip entries). It gets even more grainy, and the
risk of corruption increases. Again, we have to further take into account a number of details that are out of our scope, like how the calculation chain is defined as per the workbook part.

If I were a programmer, the question would be: Microsoft gives no library that I can use, at least not a library that I could use no matter the
execution environment. Microsoft provides an API which works in a recent .NET run-time, and installs on Windows XP SP2 and Windows Vista. There goes platform independence.

If I delete the calculation chain, I am admitting that the resulting spreadsheet is being degraded, something that Excel 2007 does not suffer from
since it takes care of that thanks to its infrastructure. In other words, making changes outside of Excel 2007 doesn't look as first-class citizen and safe
and robust as it sounded first, and in either case, either the application takes care of a lot of details, replicating what Excel itself does (there is no known
non-Microsoft Excel 2007 implementation available out there), or Excel 2007 will have to do that for me next time the spreadsheet is opened. If the changes
are made on a server without Excel 2007 installed and the resulting spreadsheet is distributed to employees throughout the organization, then every single
employee will have to get through a full spreadsheet recalculation (arbitrarily lengthy) next time they open the spreadsheet. The application is a second-class
citizen compared to Excel 2007, that's what everybody thinks: me and the employees.

What this shows pretty clearly is that either we lack the tools, or Microsoft does not think we should be doing that in the first place. With that being
said, if making a simple change to a cell is too much to ask, then what is this new format good for? The prospect of getting our recipients facing those
dreaded message boxes is not exactly a change compared to the well known ugly stories with corrupted binary file formats.

Let's play the devil's advocate now and see how far it will take us. Here is the contents of the calculation chain, xl/calcChain.xml:

The reference to formula in cell E2 is what seems to be causing the problem. All right, then since it's just XML, let's remove that reference.
Edit the calculation chain xl/calcChain.xml so it looks like this:

After the modification, if we open the file in Excel 2007, it still complains:

Excel 2007 cannot open the file even if the calculation chain is cleaned up manually

I particularly like the wording catastrophic failure. Now it looks like a lot of blood is spilling. Well, then let's stop the hemorrhage
and remove the contents of this part altogether. Make the update and open the file in Excel 2007, it continues to complain:

Excel 2007 cannot open the file even if the calculation chain is made empty

I guess it's time to take a look at the ECMA 376 documentation. In Part 4, page 2087, it says (emphasis mine):

I guess there we have it, we can't have a calculation chain part with no cell reference in it. Excel guts spilling through the specs here, doesn't it? Wait a minute,
is this supposed to make into an international standard?

Let's do a quick summary: we had Excel 2007 complain that the calculation chain was left with a reference to a formula that did not exist anymore,
but in fact Excel 2007 complains even with this manually fixed. The solution is to delete the physical calculation chain part and the relationship it has with
the workbook (defined in the workbook relationship part xl/_rels/workbook.rels), and to update some other parts as well.

Let's get a visual diff of what it takes to make a "proper" change in a cell:

Original versus updated file: 40% of parts need a change! (note: click the links to get the actual changes)

All we wanted to do was to make a change in a cell and we end up butchering the package.

Remember, we don't want to have to deal with formulas or anything even further far away from our concern, we have no notion of formulas nor should have,
the idea is to change a value in a cell. If the file format was properly designed, all it would take is an in-place replacement of the cell contents.

With VBA, to make a change in a cell, we do: Range("E2").Value = 40. Period. End of story.

40% of parts have to change in order to take into account our minor change. That certainly looks like having to care of the format's own dirty laundry,
this is not acceptable by any stretch of the imagination for a modern format reportedly (Microsoft source) aimed to improve interoperability across platforms
and applications. This is unfortunately exactly the situation in which we are with binary formats, something the new formats were supposed to fix!

Deleting a physical part is not a minor operation. Most Zip libraries can't delete a physical part without uncompressing everything in a temp folder
and rebuilding the package (thereby simulating a delete function). If your spreadsheet is big, this will take some time and space.

A full recalculation in the case there were formulas elsewhere.

Can you imagine making more complex changes like replacing a pivot table with another?
Pivot tables have many (like hundreds) ties with everything in the spreadsheet.

2. Entered versus stored values

We all take for granted that when we type a value such as 1234.1234 in a cell of a spreadsheet, that's what actually gets stored.
Excel has this auto-number format matching capability where it tries to make sense of what is manually entered in order to deduce if that's a string, a number, a boolean,
or a date, and applies a number format accordingly, but what's being stored as a value is what is entered. By the way, if you hit Alt+F11 in Excel, and enter something
like Range("C3").Value and run the macro, you'll get the entered value in cell C3; if you enter something like Range("C3").Text, you'll get the
formatted value in cell C3, where the number format has been applied to the value. Note that the return value takes advantage of the locale and number
formatting, which means you may get "1234,1234" (note the comma) instead of "1234.1234".

The problem is that Excel 2007 does not store what we entered. If we read the XML, we are going to grab numbers that have rounding errors
compared to the actual numbers we typed. Let's see how far the problem goes:

Entered value

Stored value

Rounding error

123456.123456

123456.123456

0

12345.12345

12345.123449999999

o(1e-5)

1234.1234

1234.1233999999999

o(1e-4)

123.123

123.123

0

12.12

12.12

0

Not only is there a rounding error, but its order of magnitude changes depending on the value. Ironically enough, if you entered 4321.4321, it would be stored as is, with no rounding error.

It is absolutely lost on me how implementers are expected to deal with this mess. The spreadsheet does not reflect the proper values, and
you can easily see where it goes. Imagine non-Microsoft applications used in healthcare and critical systems relying on the spreadsheet data.
Not only does the rounding error seems arbitrary (one would have to go back and study the artifacts of IEEE floating-point values, several decades of work), but it changes.
There is no way we can possibly take advantage of this, with one notable exception: if we are in an execution environment for which reading those floating-point values does not produce those
artifacts, and returns the proper entered values, then we are good. Problem: Microsoft does not document the execution environment. We can fairly assume it
is Windows, but what else? And if I am using Linux, how do I work with this?

It's important to understand that if we open the spreadsheet in Excel 2007, we see the proper values. No loss (based on the values entered)
seems to have occurred, the problem is that the data in XML just cannot be used as is.

As an aside, the stored value does not use the locale (it always uses the dot as the decimal separator), therefore we have to assume this is
all US English. If we wrote software in Excel VBA that grabs the value in cells then processes it, there is no way we could migrate our VBA code to work
with this XML part without substantial rework. We are left with Excel's own international implementation artifacts, undocumented.

3. Optimization artifacts become a feature instead of an embarrassment

Historically, the BIFF file format used in Excel spreadsheets was designed to be small and fast. But this design decision goes
all way back to early 90s, when the Pentium CPU did not exist yet. Regular desktop computers we use everyday are at least several orders of magnitude
faster and memory-friendly than those early computers were. Yet, Microsoft chose to keep those optimization artifacts as is, with the side effect that
they are now exposed to the surface as part of the XML.

Among interesting optimizations is Excel's insistence in trying to factor formulas as much as possible. This happens with Excel when you
create a formula, then drag the cell to replicate it in other cells. That's shared formulas. Excel chooses to declare the formula itself only once, and
then creates a mechanism to infer the formula in other cells (the relative position counts as an offset).

Shared formulas are supposed to be transparent for developers. Is this true?

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet

Enter value 2 in cell C4

Click on cell C4, click on the bottom-right corner, and drag cell C4 down to C10 to replicate the content

Enter value 3 in cell D4

Click on cell D4, click on the bottom-right corner, and drag cell D4 down to D10 to replicate the content

Enter formula =C4-D4 in cell E4

Click on cell E4, click on the bottom-right corner, and drag cell E4 down to E10 to replicate the formula

In cell E5, we see a "ref" attribute where the shared formula range applies, a "si" attribute which identifies the shared formula
range (itself redundant with the "ref" attribute), and the actual formula for that cell. But in cell E6, the cell below E5 in the grid, we see a definition with
just a "si" attribute. In other words, cell E6 is linked to cell E5.

Here is the problem. Let's say we make a manual change to cell E5 and remove the formula. We've seen in the first section of this article
that the calculation chain is left un-synched, but an additional problem is that cell E6 is also left un-synched because its "si" attribute now points to nowhere.
Note that the situation isn't any better if we merely update the formula; by doing so, cell E5 is fine, but linked cells will reference the new formula,
not the old one. So a simple change in cell E5 actually spreads unintentionally. It goes without saying that it's very expensive to make a simple change.

That is a direct result of the optimization artifact. Someone willing to make a change cannot proceed without taking care of depending cells if the cell defines a shared formula range.
The problem gets only bigger since we have to either remove the shared formula altogether in all cells, or translate the shared formula definition accordingly,
which implies parsing the formula (the goal was only to make a change in a cell!) and making a number of offset changes,
many of which are left for the user to discover (formula tokens are complex). To remove the shared formula, the actual formula definitions in linked cells have to be built,
and that's where an algorithm has to find a way to create these, essentially rolling back Excel's optimization as a preliminary step.

We have a case where an optimization artifact becoming an embarrassment, not a feature.

Let's see how the situation compares to the old Excel binary file format (BIFF internal format is stored inside an OLE container). Here are the corresponding BIFF records:

With the Excel binary format, the design is right. The shared formula is defined outside a cell, so you can very simply remove the formula in cell E5 without breaking other cells.
From a programming perspective, the new XML format qualifies as a regression compared to the binary file format.

4. VML isn't XML

Contrary to what the ECMA 376 documentation says in many places, VML drawing parts are not deprecated at all. VML is in fact very
pervasive in Word, Excel, and PowerPoint documents, so it's even more a blatant problem.

In the ECMA 376 documentation, part 4, page 4343, we learn (emphasis mine):

[Note: The VML format is a legacy format originally introduced with Office 2000 and is included and fully defined in this Standard for backwards compatibility
reasons. The DrawingML format is a newer and richer format created with the goal of eventually replacing any uses of VML in the Office Open XML formats.
VML should be considered a deprecated format included in Office Open XML for legacy reasons only and new applications that need a file format for drawings are strongly
encouraged to use preferentially DrawingML. End note]

Here is a way to create a VML part in a new document:

Start Excel 2007 and create a new spreadsheet

Right-click and choose Insert Comment

Enter a comment

Save the spreadsheet (xlsx file)

Close it, unzip it

The corresponding XML in the drawing part xl/drawings/vmlDrawing1.vml is:

From a pure markup perspective, it is XML, but there are application-encoded values such as ="m,l,21600r21600,l21600,xe"
and 4, 15, 2, 10, 6, 15, 6, 4 which contradict proper XML design, i.e., in a way that it is both poor XML, and cannot be used by XSLT transforms.
VML allows far more complicated values in the general case, including conditionals expressed in their own language. Let's take this example of VML:

This is XML, right? There are angle brackets, it conforms to the XML W3C recommendation, therefore it's XML.

VML also contains application-specific markup, with no documentation associated to it; for instance, in the example above,
4, 15, 2, 10, 6, 15, 6, 4. Because only Microsoft used this markup, there was no need to define it in its own namespace and so on. But now that VML is part
of ECMA 376, either ISO accepts a vendor-specific markup, which defies the point of ISO standards in the first place, or it just contradicts ISO standards.

The implication for an implementer, or for someone willing to make a change, is that there is no way someone can possibly edit this thing
without a proper implementation of the VML library itself. The risk of corruption is extremely high. Obviously, Microsoft expects that VML parts are
replaced by other VML parts as a whole, without a finer granularity. The problem is that VML too can contain references to objects and other
parts, so that contradicts even a simple template scenario.

VML is an old, undocumented library that speaks volumes of the past Microsoft lock-in strategy. Mr. Bill Gates in person sent in 1998
a memo to the Office product group (led by Steven Sinofsky at the time),
memo undisclosed to the public thanks to the IOWA consumer case:

From: Bill Gates
Sent: Saturday, December 5 1998
To: Bob Muglia, Jon DeVann, Steven Sinofsky
Subject : Office rendering
One thing we have got to change in our strategy - allowing Office documents
to be rendered very well by other peoples browsers is one of the most destructive
things we could do to the company.
We have to stop putting any effort into this and make sure that Office documents
very well depends on PROPRIETARY IE capabilities.
Anything else is suicide for our platform. This is a case where Office has
to avoid doing something to destroy Windows.
I would be glad to explain at a greater length.
Likewise this love of DAV in Office/Exchange is a huge problem.
I would also like to make sure people understand this as well.

The undocumented VML library shipped in Internet Explorer 5 in 2000, and has been part of Internet Explorer ever since. The DAV protocol
(Distributed Authoring and Versioning) is an international cross-platform standard, open to everybody. God only knows why Bill Gates likes so much VML,
and dislikes so much DAV...

For the record, the ECMA 376 documentation describes the VML markup, but it does not specify it. Much of application-defined behaviors are left for one to guess.

5. Open packaging parts minefield

The underlying architecture of how Zip entries relate together is called by Microsoft "open packaging conventions". What it means is that Zip entries are not independent,
or even related by way of a single master Zip entry which would work as a directory of all Zip entries of relevance. There is a logical tree of entries which uses separate
Zip entries to define relations between Zip entries. The logical tree has nothing to do with the physical tree of Zip entries in a package, despite Microsoft continuously
using screenshots of Windows XP's built-in Zip folders to mimic a folder hierarchy.

The problem with such an architecture is that a part may or may not relate to another and there is no standard way to know. Often, there is a r:id attribute
right in the content of some XML part that tells the application that there is a relation, but this is not standard. By the way,
Microsoft's PDF fixed format competitor called XPS is also based on the same underlying architecture, except that the team who developed XPS did not quite
want to play by the same rules than the Office team. For instance, the XPS main Zip document entry is related to one or more XPS pages with an attribute
such as: Source="Pages/1.fpage". In other words, they are not using the r:id attribute, instead relying on their own mechanism. This makes
it impossible for a generic library to know which part relates to which part, and it has an unfortunate consequence.

The unfortunate consequence is being unable to know whether a part relates or not to another part, and makes it impossible to know, when you
delete a part, if you are going to corrupt the document or not. The document becomes corrupt if it points or relates (implicitly or explicitly) to a
missing part. It's unclear why Microsoft chose this way of doing things, obviously leading to internal chaos, instead of just copying the research
from the OpenOffice project, where a central directory is used (OpenOffice Zip initiative predates Microsoft's by at least three years, despite Microsoft stealing the thunder).

When you don't know the dependencies of a part, the consequence is obvious, you leave those parts alone. If you do this enough
times, it clutters up the package, and soon enough you end up with a package containing any number of parts god only knows why they are there. Add to this,
you can add a part of any content type (arbitrary MIME type), and you have a recipe for disaster. Among other things, virus could proliferate.

Microsoft's deletePart() function which is available in their System.IO.Packaging library (itself part of .NET) does not solve this
problem. We have a case of poor engineering, creating unnecessary problems for others to worry about.

6. International, but US English first and foremost

An important ongoing tension with Office documents is the support for locales. Microsoft historically used a number of mechanisms to address this
need, but they kept evolving and Microsoft aggregated all mechanisms to keep compatibility with older versions. What was hidden is being surfaced with the
new XML. Anything that gets displayed, calculated, rendered, or stored depends one way or another on a complex and undocumented combination of locale settings
including: the Office application language, the Office application language settings (per application), the Office document language settings (per document), the
system locale of the Operating System.

To save them time, Microsoft chose to store XML using the US English locale regardless of all settings above.

This has an unfortunate consequence for implementers or those willing to make a manual change. Indeed, Microsoft is imposing everybody
else to adapt to US English locale options (separators, date formats, formula conventions, ...) despite the fact that when using Office interactively, this
fact is hidden to the user. The Office application infrastructure manages to abstract it away from users, which is a good thing. Office developers using
VBA all over the world are used to working with localized functions, the complexity is hidden to them. But since the XML resurfaces this US English
locale, all the complexity is left for one to implement. We are talking two decade worth of internationalization issues, for Office-related locale issues
and Windows-related locale issues. To get an idea of how bad the situation is, suffice to say that a Microsoft employee part of the internationalization team
in Windows has a blog where he posts daily horror stories.

Also, for Excel formulas, it means the formula names are US English formula names, which you'll never see in Excel if you are using a
locale version such as French or Brazilian. It's left for one to guess how to map function names one way to another, and of course the ECMA 376 documentation
does not provide those localized formula names. If you intend not to implement a mapping to a locale, ideally your customer's locale, it implies you are
willing to work with US English function names (plus US English separators, ...). If your company has invested in libraries or developed libraries in-house,
they cannot be used anymore.

Can it get any worse than that?

Unfortunately, yes.

Despite Microsoft's insistence to store everything using the US English locale, they still manage to store a number of contradicting country/encoding flags in the XML.
Examples of that are DrawingML and VML languages. They store encoding tags for storing text chunks, but text chunks in the document itself does not use any such
encoding tag. It is in fact entirely possible that DrawingML and VML are implementations which involve nothing localized itself but which store
localized tags in the document, while the rest of languages (WordML, SpreadsheetML, ...) are implemented otherwise: their implementation is chockfull
of encoding settings, but they need not store anything in the document itself. In other words, everything gets localized at run-time with WordML,
SpreadsheetML, ... except DrawingML and VML.

It's clear at this point that the legacy shows...One would have expected Microsoft to fix this once for all, providing a consistent framework. They chose not do so, and as a result,
it's left for any implementer or someone willing to make a change to do the heavy lifting. What we are talking about here is entire internationalization implementation stacks
which can represent years of work and stabilization. Ironically enough, you will not only have to implement this stuff (reverse engineering since it is not addressed
by the ECMA 376 documentation), you will have to implement it in a way that reproduces current Office flaws. No matter how correct your implementation is, you have
to retrofit it to work just like Office does.

To get a flavor of non-US English within US-English (thereby violating ECMA 376's own rules), all you have to do is insert a chart:

Gras and Italique are French for Bold and Italic. Just because I am using a French version of Excel 2007,
the format produced inserts French localized fragments, therefore anyone willing to read Excel 2007 files in the most general case must be ready to parse non-US English.
The ECMA 376 documentation says, in section 3.3.1.36, Part 4, page 1965 (emphasis mine):

&"font name,font type" - code for "text font name" and "text font type", where
font name and font type are strings specifying the name and type of the font, separated by a comma. When a hyphen appears in font name, it means "none specified".
Both font name and font type can be localized values.

What is can be supposed to mean? Who reviewed this documentation? And where are the localized values to be expected?

Reading the documentation, I have the feeling that:

can be is short for it's complicated, we couldn't get this written on paper in a small space

The localized names are not provided, so even if I try to do the heavy lifting myself despite the fact
that SpreadsheetML violates its own rules (it's supposed to be encoded using US English), where can I get the material to write this library?

7. Many ways to get in trouble

The extensiveness of the ECMA 376 documentation, over 6000 pages, is telling how much legacy Microsoft is willing to bring into the
future. Taking an example of such legacy clarifies what it takes to implement even a portion of the documentation. The example is text formatting.
Any of the three applications, Word, Excel, and PowerPoint, uses its own text formatting markup. Worse, the shared libraries themselves (VML, DrawingML, MathML, ...)
also use separate text formatting, each different. Even worse, if that's possible, Word has its many ways to do text formatting. Excel has its many ways
to do text formatting. PowerPoint has its many ways to do text formatting.

By "many ways" is meant different markup, sometimes drastically different: in one, you could have no country/encoding at all,
and in another, it's cluttered up with country/encoding markup.

If Microsoft were to design a general purpose Office document model (note: ECMA 376 is a description of one specific Office document: Microsoft's),
they would have factorized all of this into a single text formatting markup. God only knows why they chose not to do so, keep all the
legacy, and try to get away with this mess by making as little publicity as possible about it.

Now enter the implementer, or someone willing to make a change to a document. There are three scenarios:

write a document

read a document

read and write the document

The third scenario is just a combination of the others, so there is nothing interesting to say about it.

The first scenario is the most simple. To write a document, of a given type, including a given set of objects (from shared languages or
not), you only need to write the document in a way that is compatible with the expected XML. In other words, you can use only one text formatting markup
model. It's you who decide which one, whether you implement one or more, and so on. So from a writer perspective, you don't suffer the problem very much.

Now consider the second scenario. To read a document, you cannot assume what's in that document, therefore you've got to implement all
possible combinations of objects that may be part of the document. In particular, you've got to implement all ways to get text formatting markup
models because that may well be the XML you face. This is a horrible scenario. To support this scenario, either you are Microsoft, or you have a number of years
of work ahead on the subject with plenty of implementation done already. There is no way around, the barrier to entry to this scenario is sky high.

Of course, if you read a document, read the markup, and do nothing with it, or nothing of substance with it, it's not quite the same
problem. But then, remember that even reading a small chunk of markup can be complicated because of the implicit semantics. You don't need a lot of XML
markup to find yourself unable to process it in any meaningful way.

To give you an example of how bad the situation is, here are four different Excel text formatting markup chunks, all meant to do the same thing (not entirely accurate here,
but you get the idea):

The beauty about a file format that is impossibly hard to read and update, and is decently easy to write from scratch, is that it fits perfectly in the read-only model
that is exactly the reason why Microsoft has a monopoly in Office documents. As a side effect to its proprietary-ness (Office 2007 documents are extensions of ECMA 376 documents),
it provides zero interoperability with anything else that exists.

From a technical marketing perspective, you can always try to start a project that will support ECMA 376, but your chances to complete the project is exactly zero.
You can hear about projects starting here and there, you can hear about programs that write documents that can be opened in Office 2007 (note that the only test that
can be possibly be made is whether or not Office 2007 opens it, regardless of the flaws of the said program, regardless of the impedance mismatch between Office 2007
documents and ECMA 376 documents), but that is not evidence of progress in interoperability across applications and platforms, contrary to what the claims are.

For instance, to this date, there isn't a computer program that perfectly mimics any of the old Microsoft Office versions 97-2000-XP-2003. It just does not exist.
In addition to the impossibility of perfectly mimicking a complex and proprietary software, third parties that would go too far supporting the binary file formats would
face the wrath of violating Microsoft intellectual property. Example: VBA macros. With Office 2007, Microsoft is bringing all of this forward for backwards compatibility
reasons, therefore "opening up" changes nothing. What was proprietary is still proprietary, and barely referenced in ECMA 376.

8. Windows dates

Microsoft uses all kinds of date types, not just from one Office application to next, but even at an library level, there are differences.
Despite storing document metadata properties of type date in an ISO compatible format (metadata properties are, for instance, the creation date
of the document), Microsoft insists on using their legacy date types elsewhere. Unfortunately, this has consequences.

It's well documented elsewhere that the date type Microsoft uses for spreadsheets is basically flawed for a number of reasons. But what
isn't often said is that the date type is actually the OLE date type. Here are the corresponding Windows OLE API functions:

A variant time is stored as an 8-byte real value (double), representing a date between January 1, 100 and December 31, 9999, inclusive.
The value 2.0 represents January 1, 1900; 3.0 represents January 2, 1900, and so on. Adding 1 to the value increments the date by a day. The fractional part
of the value represents the time of day. Therefore, 2.5 represents noon on January 1, 1900; 3.25 represents 6:00 A.M. on January 2, 1900, and so on.
Negative numbers represent the dates prior to December 30, 1899.

Using the SYSTEMTIME structure is useful because:

It spans all time/date periods. MS-DOS date/time is limited to representing only those dates between 1/1/1980 and 12/31/2107.

The date/time elements are all easily accessible without needing to do any bit decoding.

The National Language Support data and time formatting functions GetDateFormat and GetTimeFormat take a SYSTEMTIME value as input.

It is the default Win32 time and date data format supported by Windows NT and Windows 95.

The VariantTimeToSystemTime function will accept invalid dates and try to fix them when resolving to a VARIANT time.
For example, an invalid date such as 2/29/2001 will resolve to 3/1/2001. Only days are fixed, so invalid month values result in an error being returned. Days are
checked to be between 1 and 31. Negative days and days greater than 31 result in an error. A day less than 31 but greater than the maximum day in that month
has the day promoted to the appropriate day of the next month. A day equal to zero resolves as the last day of the previous month. For example, an invalid
date such as 2/0/2001 will resolve to 1/31/2001.

As explained, the function makes all kinds of fix-ups internally, and that's exactly the problem. The date type used in Excel, which
is that one, is incompatible with everything else out there, platform-dependent, and undocumented (this documentation describes behaviors, but it does not specify the date type).

If Windows dates were replaced by ISO dates, a dependency on Windows would be gone.

So, to read a cell containing a date in a spreadsheet and make sense of it implies you are using one of these two OLE API function
calls. Even better, there is no cell date type. A cell is either a string, a number, ... but a date is a number with no associated type. In fact, the only
way to know the cell contains a date is to lookup and parse the number format that may be associated to that cell. And that's where you enter another black hole.

The second one is particularly interesting, it says: use the French locale (40C), if the number is greater than 120, apply red, and do
formatting as follows, with plenty of padding/layout special characters (described in the ECMA 376 documentation, but not specified). Implementing those undocumented patterns
requires north of 10,000 lines of code, all subject to wild guesses and platform interoperability problems.

An even more interesting bit is that the XML used in ECMA 376 departs in a number of non-standard ways compared to the XML produced by Excel 2003 (data-only XML). Let's make a comparison.

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet

Enter 11 Jan 2001 in a cell (note: translate this date according to your own regional settings)

If you are puzzled by what the value 36902 might be, its Excel's OLE date encoding! It is not possible to infer the cell contains a date.
You have to explore the number format attached to style s="1" which in turn is described in another part of the package, xl/styles.xml:

This old Excel 2003 XML is not nearly as bad as the one in ECMA 376. We can infer it's a date thanks to the ss:Type attribute. And, goodness, the date is encoded
using ISO 8601, which is definitely a good thing for interoperability purposes.

The obvious question: why is the new XML a proprietary encoding, when Microsoft managed to ship Excel 2003 (this Excel 2003 XML can also be generated with Excel 2007)
with dates supporting an international standard encoding, and no need to go into proprietary parsing? Why isn't SpreadsheetML an extension to the old Excel 2003 XML following
the same good principles?

9. All roads lead to Office 2007

In part 2 of ECMA 376, page 96, we learn:

Table H–1. Package model conformance requirements

M1.30 The package implementer shall name relationship parts according to the special relationships part naming convention and require that
parts with names that conform to this naming convention have the content type for a Relationships part.

In theory, this requirement allows unattended part renaming. But this means in practice that some process in the processing chain may reshuffle an entire package on its own,
which may break assumptions from other processes of the processing chain.

Interestingly enough, Excel 2007 takes it to heart to do this reshuffling. If you create a package with part names and relationships that perfectly conform
to open packaging conventions as defined by ECMA 376 part 2, then this by no means provides neutrality throughout the entire processing chain, especially if you
open the package in Excel 2007 and save it without making changes.

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet

Save the spreadsheet (xlsx file)

Close it

Unzip it

Edit the [Content_Types].xml part manually, and replace /xl/worksheets/sheet1.xml by /xl/custom/sheet1.xml

What you should see is that our custom structure has been replaced with an Excel 2007 hardcoded structure. If a business process assumes the existence of a custom structure,
it won't work. Why is Excel 2007 reshuffling the package? It's ours!

10. A world of Zip+OLE files

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet

Click on the Office button, select the Prepare menu option, and then Properties

Enter metadata such as author, title, ...

Click on the Office button, select the Prepare menu option, and then Encrypt Document

Enter a password, re-enter the password

Save the spreadsheet (xlsx file)

Close it

If you try to unzip it, you'll get an error. When you password-protect any Office 2007 document, it becomes an OLE document. Wait a minute, isn't Microsoft moving to Zip files?

Here is a screenshot of what you should see in an OLE document viewer:

If you roll-over your mouse on the file in Windows Explorer (note: behavior not tested on Windows Vista), you should get a flying tool-tip with just minimal information
that is provided by the disk file system, but not the document metadata (author, subject, title, keywords, last modification date, ...).

The OLE container contains a number of OLE streams. The document metadata is not available for consumption, meaning that you cannot retrieve this information either.
This is a regression compared to binary file formats.

In fact, there are two regressions: the document type changes; the metadata is not accessible. And the password-protection mechanism is undocumented as a whole.

Let's see how exactly it compares to an old binary password-protected Excel file.

To reproduce the scenario:

Start Excel 97 or 2000 or XP or 2003 and create a new spreadsheet

Go in File / Properties

Enter metadata such as author, title, ...

Go in Tools / Options, select the Security tab

Enter a password in field Password to open and click OK, re-enter the password

Save the spreadsheet (xls file)

Close it

If you roll-over your mouse on the file in Windows Explorer, you should get a flying tool-tip with all the information that is provided by the disk
file system plus all the document metadata (author, subject, title, keywords, last modification date, ...)

Indeed, if we open the xls file in an OLE document viewer, the file structure is kept intact, only the OLE document stream itself is encrypted (contains the encrypted BIFF content).
The stream holding the metadata, ISummaryInformation, is left unencrypted. Here is a screenshot of a password-protected binary Excel file in an OLE document viewer:

A legacy password-protected MS Office document is still an OLE file

This has unfortunate consequences for Content Management Servers (CMS). By making the metadata of password-protected Office 2007 documents unavailable,
existing CMS workflows are arbitrarily broken. By changing the content type of the document, only because it's password-protected, existing CMS workflows have another reason to break.

Note: if Microsoft SharePoint Server, or Windows Vista, manages to show the metadata of a password-protected Microsoft Office 2007 document, perhaps it knows something that we don't...

11. Document security is a (bad) joke

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet (or open an existing spreadsheet).

Can you think of a worse document security? We have a format that encrypts buffers into OLE containers when it shouldn't, and leaves XML password hashes
right in the Zip file in such a way that anyone can manually edit the file and remove it.

When you consider the amount of professional spreadsheet
workers out there relying on sheet protection to ensure the integrity of their spreadsheets, you have no idea the impact of this flaw when they learn it...

There goes another regression with those formats. Note: the Excel binary format had a similar PASSWORD record stored on a per worksheet basis, but the big
difference is that none of the target audience is able to edit a binary Excel file. It's binary, therefore it's a form of security by obscurity. And since binary
files contain indexes, removing a record using a hexadecimal editor without taking care of the dirty laundry would automatically result in a corrupt spreadsheet.

I guess it's fair to say it's a case where document security needs its calculated chain to make such manual changes less easy! (See the first section of that article.) Oh the irony!

12. BIFF is gone...not!

BIFF (Binary Interchange File Format) is the name of the Excel binary file format. A BIFF buffer is contained within an OLE container, and consists of a sequence
of BIFF records. A BIFF record consists of an identifier, a length to follow, and the corresponding buffer of the said BIFF record. BIFF itself is platform-neutral,
but of course OLE isn't. Anything stored within BIFF records referencing Windows-specific functions (such as the DEVMODE structure of printer settings) voids
any platform-neutral claims. Every Excel release has its own BIFF-specific records. Those set of collective records are called BIFF8 for Excel 97, BIFF9 for Excel 2000,
BIFF10 for Excel XP, and BIFF11 for Excel 2003.

One of the claims from the introduction is that ECMA 376 documents are moving to Zip and XML exclusively. This implies that BIFF is gone. Let's see if that's true (emphasis mine).

From: Doug Mahugh (Microsoft)
Sent: August 21, 2006
To: Stephane Rodriguez
Stephane,
I shared the link with some of the people over in the BIFF12 group today,
and they liked the article too. They had some additional info for you which
may be of interest. These sorts of details will all be documented in
the BIFF12 documentation that will come out when Office ships, but for now
there's no way you could know what's going on for sure (as you explain
in the article)
(...)
Doug Mahugh
Office 2007 Technical Evangelist
425-707-1182 | MSDN Blog | OpenXmlDeveloper.org

Ironically enough, to this date, the only article shedding a light on BIFF12 available on the internet is this article. Microsoft made no such document available.

There is more than meets than eye.

Besides the point that BIFF is actually not gone at all, an even bigger shocker is that BIFF12 is a departure from BIFF11. It's not an extension of BIFF11 with
new BIFF records, it's a complete rewrite of all BIFF records, a new BIFF record model, and new content encoding. So any person who would have invested
in a BIFF library (or grabbed it elsewhere) will have to redo that work from scratch. Where is backwards compatibility gone?

The Excel binary format is the second full fidelity format for Excel 2007. It is similar to the Office Open XML format in structure - a set of related parts,
in a Zip container - except that instead of each part containing XML, each part contains binary data.

Even though we've done a lot of work to make sure that our XML formats open quickly and efficiently, this binary format is still more efficient for Excel to open and
save, and can lead to some performance improvements for workbooks that contain a lot of data, or that would require a lot of XML parsing during the Open process.

It's hard enough to understand that Microsoft Office is moving to XML, and still away from XML at the same time, but the official justification is performance related.
Unfortunately, this justification is just a lie since it is already the justification for the extreme awkwardness of the SpreadsheetML XML file format itself.
Someone from the Microsoft Office team has this to say about it:

There are a number of things we looked into doing to help improve the performance of these files (SpreadsheetML) both using current technology as well as thinking about
future approaches. Some of the approaches we took that are easiest to understand are:

Reduce the need to fully parse repeating data - In order to cut back on parsing large strings repeated multiple times, or formulas that are repeated down an entire
column, SpreadsheetML does a lot of sharing. I already talked about the huge benefits we can get from the shared formulas work in a post last spring. The shared string
table can give similar results.

Tag size - The size of XML elements actually does directly affect performance times. The more text you have to parse, the longer it will take.

Splitting the XML into multiple parts - In SpreadsheetML, we break the XML content out into multiple parts in the Zip (each worksheet, the string table, pivot table data,
etc.). This can help a ton with on-demand loading, or even loading multiple pieces at once. For example, it means that if you had a multi-proc machine, you could load each
worksheet on a separate thread. It also means that you could decide to only load one sheet and you wouldn't have to parse through all the XML from the other sheets.

Relationships stored outside of content - By storing all the relationships from one part to another in separate (and much smaller) files, it makes it really easy to see
what other parts you should load when you are loading a particular part of the file. If it weren't for the relationships, you'd actually have to parse through the content
of the XML to determine what other resources that part used. For example, if you wanted to just load one slide in a presentation, you can also see what images are used on that
slide before you start parsing the XML for the slide.

At this point, we are left with the obvious question, if the SpreadsheetML is made much more complex than we would have expected only to cope with performance problems,
what is the rationale for the binary workbook (.xlsb)?

Microsoft won't tell. There are two reasons however:

embarrassing reality of XML

embarrassing reality of ECMA 376

To illustrate the first untold reality, suffice to create a new spreadsheet, and then query external data. As you do this, Excel 2007 creates a part called the connection
data source part, where it stores the connection strings in plain text, among other things. It should be clear by now that connection strings contain sensitive
information such as server names, login, and passwords. Oops!

The quick Microsoft solution to this? Security by obscurity, just turn this stuff into binary records (BIFF12), and the problem goes away in theory.

A similar problem with XML parts is that password hashes are stored in plain-text, as we've seen in the previous section, meaning that armed with a simple text editor,
both password hashes and connection string passwords can be edited and/or removed. Vulnerabilities by design?

The embarrassing fact for ECMA 376 is that since this is supposed to be XML only, the new binary workbook, despite being the official answer to the "plain-text" problem,
cannot be part of said documentation otherwise automatically violating the claim that those documents are made with XML.

Last but not least, a pseudo-rebuttal was posted to address the lack of availability of the BIFF12 documentation, a minor problem in comparison to the XML violation rules
by the way. This pseudo-rebuttal was explaining that the documentation of binary formats can be freely obtained. Here is the corresponding
article. In which we learn:

Microsoft makes its .doc, .xls, and .ppt binary file format specifications available under a royalty-free covenant not to sue to anyone who wishes to
implement all or part of these specifications in their products. Implementation includes the ability to use the specification documentation
for analysis and forensic reference purposes. Microsoft Office Drawing File Format for 2007 and Visual Basic for Applications (VBA) File Format for 2007 are
also available under this program.

No right to create a competing product.

As for whether Microsoft responds at all to any such documentation request, it remains to be seen. Last but not least, whether the documentations contain material
that are required even for just analysis or forensic purposes remains to be seen. Microsoft used to distribute those documentations as part of the MSDN Library,
until February 1998 (right when Mr. Gates had a pinch for Office documents viewed in Internet Explorer). Those documentations are incomplete, often just descriptive,
and full of typos. In fact, just as the ECMA 376 documentation itself.

Now enter BIFF11+. Yes, you heard that right! Microsoft created not just one new BIFF file format, it also took the time to extend
BIFF11 (i.e., Excel 2003's BIFF file format) to include new BIFF records. What for?

The reason is round-tripping of spreadsheets. Imagine you are creating a new Excel 2007 spreadsheet using some of the new features, such as the formatting
databar (a bar chart drawn inside cells). Then save this file as an Excel 97-2003 compatible file in order to facilitate collaboration with others (not using Excel 2007).
What you still expect is that this file preserves those features. That's exactly what BIFF11+ does, just that Microsoft hasn't considered documenting those new BIFF records,
not even talking about it by the way.

To reproduce the scenario:

Start Excel 2007 and create a new spreadsheet (or open an existing spreadsheet).

Enter values 10,20,30,40,50 in separate cells, then click on the ribbon on the Style button, Conditional Formatting, Data bars, and then pick a color gradient.

Save the spreadsheet as an Excel 97-2003 Workbook (*.xls).

A prompt warns that there is going to be a significant loss. Click Continue.

Open the .xls file with any older version of Excel (97/2000/XP/2003).

The file opens with no prompt, and the databar does not show up (which was expected).

Enter value 60 next to the other cells, and save the file as a regular .xls file, just like it is.

Open this file in Excel 2007: the databar magically shows up. The databar, however, does not include the value we just entered, meaning that we have created a discrepancy.

How can the databar show up at all in Excel 2007 since we went through two independent .xls writing phases (one with Excel 2007, one with Excel 97/2000/XP/2003)?

You guessed it, new BIFF11+ records are created and preserved. None of that is documented, meaning that this round-trip scenario only works with Microsoft Office. It works as follows :

When Excel 2007 creates an Excel 97-2003 workbook (*.xls), it actually creates a BIFF11+ file. Because BIFF11+ is an extension of BIFF11, it can be opened
in Excel 97, 2000, XP, and 2003.

And because how BIFF works (any record in the header and footer of internal sections is preserved as is), it allows to do a save
in Excel 97, 2000, XP, and 2003 without killing those new records.

Since Microsoft created such new BIFF records to represent the new
features: theme, databars, new visual candy in charts, ... but none of it is documented, it follows that interoperability across those file formats is exclusive to them.

BIFF11+ records are not a subset of BIFF12 records. Implementers will have to implement/preserve BIFF11+ records and BIFF12 records.

We have a discrepancy in the databar that is left for actual spreadsheet users to worry about. The value 60 is left disconnected to the formatting databar, even in Excel 2007.

To close up the discussion about BIFF, let's just take a look at an example of BIFF11+, actually just an excerpt from a file (in bold, the new BIFF11+ records):

Here is a simple chart created with Excel 2003. Open it with any of Excel 97/2000/XP/2003,
then open it in Excel 2007. Microsoft said they could not change the internal structure of Excel spreadsheets (and other Office document types) because they had to provide
100% full fidelity otherwise their customers would not want it. Is this true?

A simple chart in Excel 2003

Same file opened in Excel 2007

The differences are:

Vertical axis all set to automatic scale/min/max.

Also impacts the number of horizontal gridlines in the background.

Chart title font not the same weight.

Chart title incorrectly positioned vertically.

Legend border incorrect.

Legend entries incorrectly positioned.

Spacing between the plot area and the legend.

<sarcasm>Programs used by hundreds of millions people need no special attention, that goes without saying....</sarcasm>

Good luck programming charts. Microsoft provides no documented mapping between the Escher library (pre-Office 2007 era) and the DrawingML+VML libraries (Office 2007 era,
before it gets dropped to something else in the future). Is there possibly a reason why?

Answer: Microsoft has dropped the existing chart drawing engine in favor of a new library. It is impossible in practice to rewrite a library with new source code,
a new agenda (visual candy), and manage to get 100% full fidelity with the past. This results in improper drawing of existing charts. Which means, in other words:

Microsoft is lying, the part of ECMA 376 documentation describing charts does not 100% map the old chart drawing spec

The implementation itself is buggy, an automatic consequence of the first point

14. ECMA 376 documents just do not exist

Last but not least, ECMA 376 documents just do not exist. The reason why is manifolds:

Office 2007 documents are incompatible with theoretical ECMA 376 documents, since the ECMA 376 documentation says among other things
that VML is deprecated, and Office 2007 documents are still using plenty of it when creating new documents.

Microsoft strategy itself is to provide as little information as possible about the huge impedance mismatch between ECMA 376 and
the actual implementation. Expect ECMA 376 to evolve only marginally, while Office 2007 next version will come with plenty more of Microsoft proprietary
layers, especially those integrating the Microsoft Office suite with Windows (on the client), more undocumented integration points between Microsoft Office,
Windows, and SharePoint (on the server), more undocumented client-server protocols (between the Microsoft Office client and servers running Microsoft server software).

15. How the ISO OpenDocument format (ODF) compares?

When I wrote this article, I wanted it to focus on Microsoft Office XML formats exclusively, but Microsoft apologists out there running out of arguments to find a justification
as to why Microsoft Office XML formats are that bad, turned to the typical rhetoric, started sending piques at the ISO OpenDocument format (ODF). I thought, well,
let's do a quick comparison on all the points above. Results are as follows:

Self-exploding spreadsheets: I have created a visual diff
of two ODF spreadsheets, before and after the modification of a cell containing a formula. And sure enough, there is no calculation chain, which means
it's easy to make manual changes to existing ODF files, no hassles, you don't have to care about the format's own dirty laundry. And to remove the formula,
you just remove the corresponding XML formula attribute in that cell. ODF : 1, ECMA 376 : 0.

Entered versus stored values: The entered values are stored as is. There is no rounding error. ODF goes as far as storing both
the entered value and the displayed value (after the number format is applied) which is a wonderful thing for those willing to extract data.
ODF : 2, ECMA 376 : 0.

Optimization artefacts: There is no such thing as a shared formula concept in ODF. I guess ODF needs not be encumbered with optimization techniques that were valid 20 years ago,
when computers were a thousand times less powerful than right now. ODF : 3, ECMA 376 : 0.

VML isn't XML: There is no VML or anything fundamentally non-XML in ODF. If you create a note in an ODF file, the corresponding XML fragment is:

This is proper XML, one value per attribute, the fragment is full in-context in the stream. Plus, you can grab an existing SVG library to make sense of the XML,
no need to implement a single vendor's soup. ODF : 4, ECMA 376 : 0.

Open packaging parts minefield: There is no such thing as a tree of parts. Instead, ODF has a central directory whose management is deterministic
and in fact trivial. In retrospect, why did Microsoft go with such a problematic concept of a tree part with undeterministic relations? Was it because strictly
copying ODF was embarrassing (Not-Invented-Here syndrome)? ODF : 5, ECMA 376 : 0.

International, but US English: Everything is stored uniformly in US English, there is no discrepancy. The locale is applied as per
the style definitions, separate from the data, which is a good design in decoupling that implementers can take advantage of (smaller implementations,
less prone to errors, especially in a global market context). ODF : 6, ECMA 376 : 0.

Many ways to get in trouble: ODF defines a single style concept across the entire format. So not only any Word, Spreadsheet, or Presentation document
has a single style concept instead of many, but both document types share the same style concept. This translates in interoperable formatting across applications
and platforms, plus small implementations. Contrary to ECMA 376, implementers don't have to implement more than 15 years
of formatting legacy. ODF : 7, ECMA 376 : 0.

All roads lead to Office 2007: ODF also destroys any custom part name. ODF : 8, ECMA 376 : 1.

A world of Zip+OLE files: After encryption, an ODF file is still an ODF file (i.e., Zip, XML). ODF encrypts the content part (using a standard algorithm
described in the manifest part). The metadata part is left unencrypted. ODF : 9, ECMA 376 : 1.

Document security: ODF does the same, it saves a password hash that can by manually removed. ODF : 9, ECMA 376 : 2.

BIFF is gone: ODF is Zip and XML. If you insert or import an OLE object, it's saved as a binary part, just like ECMA 376 does, but the ODF streams
themselves are still XML. ODF : 10, ECMA 376 : 2.

Document backwards compatibility: I've installed the oldest OpenOffice release I could get (1.0.3, released in 2003), loaded the same chart
as in the example, saved as an .SXC file (this was the name extension before it became an .ODS file later), then loaded the .SXC file in OpenOffice 2.2.
The chart is exactly the same in both versions. ODF : 11, ECMA 376 : 2.

Final score: ODF: 12, ECMA 376: 2

To better cope with those problems, ECMA 376 will have to be redesigned to something that is pretty much what ODF does. Why isn't Microsoft extending ODF instead
of reinventing a bad wheel? Customers would benefit a great deal of a single standard.

First of all congratulations Stephane for very good article. But what when Microsoft creates new Excel better then Excel 2007?

I would like to point out that every day new technologies come out for people to use so some application that works now must not work in future due to technology change. Because of that you need a way to build your application so it doesn't depend on any version of Excel or other spreadsheet program. To avoid that try using GemBox .NET Excel component which makes you independent of version of Excel you are using. Even better, clients that use your application don't even need to have Microsoft Excel installed.

Spreadsheets have grown bigger and bigger. I guess the reason is that users want more functions. Still, why package them in this way? Just because the users want to access many functions in a uniform way there is no reason to bundle them up in an incomprehensible way. Why not follow the Internet Explorer container approach? I suspect that it is the bundling that has created all the problems so well examplified by Stephane Rodriguez.

Myself I would prefer a modular structure behind the scenes where Microsoft could open up the various interfaces for independent developers. I thought they were actually doing this with XML and <%Eval statements for inserting from business databases. Check out those great new videos from the Visual Basic Team at Microsoft on www.asp.net/learn/linq-videos! Video number 7 tries to explain how to create Excel Spreadsheets from database data using LINQ to XML. Seems like a great idea to me!

The developer wants open easy to use interfaces so he or she can quickly combine and process data from different sources. The user doesn't care how the numbers get there. Some like to get them into web pages, others into speadsheets. I never trusted any date time functions in Excel. I never trusted any rescaled graph that was produced by this software. Matlab and other engineering products are needed for accurate displays. OK, the sales people may be happy with some nice looking graphs produced by Excel. But these should not be taken seriously. I believe now the time has come to break up the spreadsheet into a well-known user interface-container and different plug-in functions whith public interfaces for developers. I believe this could solve the problems pointed to by Stephane. If Microsoft does not rise to this challenge Sun and others sure will!

Congratulations for this very precise analysis. I have been fighting with the XML of Excel 2007 for a few days, trying to tranform code written in VB.NET and producing EXCEL 2003 workbooks, based on templates, in XML to the new 2007 version. In code for 2003 I was parsing XML to populate NamedRange values. I found a trick to reuse the code to target the new definedname tag of 2007 (stored in workbook.xml). When I zip the 2007 XML and open the XSLSX, I find the values in Name Manager but all cells where I display defined names constants show a desesperate 0 value. After a few hours of intensive research, I realised that the tags in SheetX.xml where not updated once opening the XLSX (even if the Calculation method is set to automatic). If I have also to programmatically modify all XML sheets documents, I will definitely have a less effective process than what I was able to do in Excel 2003. If someone has already played with the definedname tag, I would appreciate any help. Few documentation seems to be available on the NET.

This article worths every single line.
Im verry pissed of by some people at work embeeding ole objects inside word documents, but now that I've read this article im thinking that I was only seeing the top of the iceberg.
I'll probably forwad this article to all my co-workers because it's worth reading.
As a developer I never really liked XML anyway, but since most of the reporting applications running in our servers tend to export some kind of XML document I thounght about writing some sort of porting tool, now that enterprise is out of the question.
Thanks.

Stephane's article is excellent and correct, but I think is missing the point entirely.

I think the point is...
Question 1: Why does any company do anything?
Answer 1: They want to increase sales/penetration of their products.

Question 2: What strategy would make sense to MS to increase sales/penetration of their products?
Answer 2a: Open their proprietary file formats to the world and allow anyone to easily read and write them. This is absolute nonsense as the first thing that would happen to MS if they did that would be a rush of free PowerPoint and Excel clones that would manipulate Office 2007 files without the need for Office 2007.

Answer 2b: Open their proprietary file formats to the world and allow anyone to easily WRITE them, but make READING them effectively close to impossible? GOOD ANSWER. Now there are simply more Office 2007 files around but you still need Office 2007 to properly read them.

So I believe that the problems I and other people are having effectively reading an Office 2007 file is a built-in and deliberate strategy. I also think we're going to have to learn to live with it. Essentially what is needed is a layer between the .???x file and the XML node functionality that (aside from respecting the published schemas) understands how to maintain the (unpublished details of the) structural integrity of the .???x file. Any takers??

1. I made three experiments - with Word, Excel and Powerpoint;
2. I opened one of the Office applications listed above;
3. Then I selected Save or Save as and saved the new file (notice - no changes to the file at all; we have an empty file);
4. I navigated to the saved file, renamed it to ....zip, unzipped it; deleted the old zip, zipped back the folder of the file (notice - no changes to any of the XML files or folders) and renamed it to the proper extension (.docx, .xlsx or .pptx respectively);
5. I opened the file - I got an error message warning about unreadable content.

"By changing the input value without changing the corresponding output value in the saved file, the file was corrupted."

The example does not do such thing. What we do is make an arbitrary change to a cell. The change does not even try to know if there is a formula, nor does it should (in VBA, what we do is Range("E2").Value = 40).

As an aside, if you keep the formula, and only change the formula cache value (40 instead of 30), Excel will not necessarily complain, contradicting the patronizing behavior of the calculation chain. In this trivial example it won't complain, but I'm not sure it would be the same with more complex formulas (external references, table cell projection, ...)
This scenario in fact exposes another flaw in the file format. If we assume the format is stricter and requires to keep a calculation chain perfectly synched, then Excel has the responsability to complain too if someone makes arbitrary changes in values associated to formulas that are errors (in this case, arbitrary changes). The blatant problem is that if someone makes such change as part of the processing chain, and the processing block next to that one assumes the formula cache value is correct, this will obviously lead to flawed value extraction/computation/rendering/printing/...

There are more scenarios exposing flaws. In fact, the new file format brings ALL flaws of the binary file format (many ways to corrupt the file if an application other than Excel makes a change), plus new flaws.

I don't know if the article will be updated with more flaws, just on the calculation chain subject there is more to say. If my goal was to be exhaustive, basically ending up redesigning/rewriting/reformulating most content of the 6,000 pages of the documentation, I would be in it for years. I have a job, so I won't do that.

I heartily agree with almost everything you have said, particularly with regard to the giant leap backwards MS seem to have taken from Office 2003 XML format, which at least one could safely use XSLT transforms on.

Just one small niggle though; the precision of the stored values is variable but in the examples you give the errors are O(1e-12) and O(1e-13) not O(1e-5) and O(1e-4). It is still a burden to have to convert that to some "nearest number within 0.5e-10, say", and what if the number stored was correct and the conversion/import dropped important precision ?
-Greg

o(x) is the well known notation in mathematics. I'll stand by the fact that the rounding errors are o(1e-5) and o(1-e4) respectively. The main point is that they are non-zero values despite the fact that XML is just text, therefore an entered value should be preserved as is. What it actually demonstrates is that the Excel team could not be bothered by XML for what it really is. This problem with refusing to use XML features ramifies through the entire spec. If I were cynical I would say that the XML effort started in 1998 leaves to desire a bit. (and it's quite odd when you see how good Excel 2003's XML was. The interesting question : who at Microsoft forced the reversal?)

I show in Section 15 of the article that ODF does actually that : preserved values as is...