Defective by Design

If you are a regular reader of this blog, or almost any other blog covering the file format battle, then you have certainly come across comments from Stephane Rodriguez. He brings a unique perspective, knowing more about the legacy binary Excel formats than any sane person I know. So it a treat to have from Stephane a fuller exposition of some problems he ran into working with the Microsoft OOXML formats, in an technical article called Microsoft Office XML Formats? Defective by Design.

Enjoy!

27 August Update: Slashdot has further coverage of Stephane’s article. Some good comments and perspectives can be read there.

First, he complains that Office stores 12345.12345 as 12345.123449999999, and thinks that the difference between these two number is “1e-5”. For one thing, the difference is 1e-12, not 1e-5, which comes down to a relative error of 8e-17. Now, for some applications this may be significant, but it does not mean that Microsoft has some hidden information they are not revealing in the file format — it just means that they are using binary floating point internally to store numbers for their spreadsheet (as opposed to fixed point, or at some blessed future date, decimal floating point). This is a property of the implementation, not the file format.

Second, he complains that the format internally uses US English, regardless of what language the software is localized for, and in particular that “formula names are US English formula names”. Think about what he is suggesting for a moment: if a formula like “SUM” were stored in the local language, for example, *every* implementation would need to support *every* language in order to open files created from other languages.

Anonymous said “…It is quite logical for MS Office to have certain flaws and/or failures in implementing a new format.”

Are you simply naive or just pretending ? If MS cannot implement their own spec how can u expect others to do so.Further in this case the OOXML spec came from MS after Off2007. So it is clear that the spec is faulty.

I wonder what the security implications of the problems mentioned have. I mean can I create a document which looks different on some obscure condition like viewing date? The point being that there would not be any visible macros or functions.

This raises another question: how hard it would be to verify how good a document format is in robustness sense.

Rob, thanks for the link. Note that I’ve added a couple more sections.

To the first commenter, no, these are spec flaws. In the first example, the calculation chain should be required to be governed by cells, not the opposite. This translates in a set of rules that are not in the spec.

To Steven,– rounding error : point taken. My point though is simply to show entered and stored values is not the same. Anything stored with some whatever execution environment bias is faulty for me, especially when the bias is undocumented. – US English bias : 1) this is not how VBA works. Office developers use VBA. Try VBA and you see everything is localized for you, including formulas, contexts and so on. So when Microsoft markets that direct access is a good alternative to VBA development, I want to remind people out there that perhaps it’s not exactly the case 2) as I mentioned in the article, there is plenty of contradictory country/encoding flags anyway, for instance the shared libraries (VML, …) store those flags, meaning that the intention is different than SpreadsheetML/WordML/PowerpointML. But all of that is within the same package in parts that relate to each other!

What struck me the most with the multitudinous text formatting details, was that this is likely to be the first port of entry for the malware creator. If MS Office 2k7 is juggling all those text formatting details at one and the same time – and they are converted to and from XML – then there is ample space for some luckless programmer to have missed something out, and allowed a humungous great big ginormous hole in MS Office 2k7.

All the more reason for people not to bother with implementing it.

After all, if we take the point of view that all this kerfuffle over File Format Standards is actually about providing normative definitions of “Office Suite” and its component parts, then by containing such a heterogenous mixture, Microsoft has practically admitted they have no idea what such a normative definition should be like.

And as such, should be ignored until they come up with something intelligible.

There is more going on in the numeric storage model than meets the eye, and certainly more than is in the standard.

For example, as Stephane pointed out, enter 1234.1234 into a cell and save it in Excel 2007. The value in the XML is stored in the “v” element is “1234.1233999999999”.

OK. I can live with the idea that a particular floating point model cannot represent all rational numbers exactly. 1/3 can’t be represented exactly in base 10. Evidently, 1234.1234 cannot be represented exactly in whatever system Excel is using.

However, let’s read what the OOXML specification actually says. Part 4, Section 3.17.6.6 says of the v element, that for numbers it shall store “The unformatted text form of the number, as accurately as possible.” I guess I would have thought that getting the 4th decimal place correct would be part of “as accurately as possible.” Evidently not.

When I load the document again, it displays and edits correctly, as “1234.1234”. So, what I entered is what I got back. So my first inclination is that 1234.1234 is the same as 1234.1233999999999, at least to the limits of machine precision. But I was wrong. If I enter 1234.1233999999999 into a cell and subtract it from 1234.1234 I get 1.00044E-11.

So just for fun I saved that file too. You would have thought that 1234.1233999999999 could be represented exactly. We just stored it before, right? But no, that would be too simple. What is actually stored is1234.1233999999899 !

Now Excel seems to get the right answer every time, so we can’t complain there. But how sure are we that other OOXML applications will do this right? There is nothing in the standard that suggests that “1234.1233999999899” should be converted to “1234.1233999999999” when reading, but that “1234.1233999999999” should be converted to “1234.1234”.

Just for more fun, I open the OOXML file and edit the XML directly, adding a cell with the value “1234.1234”. Any guess for how this is displayed in Excel? Believe it or not, it displays correctly. So Excel doesn’t have any problem reading the right values, it just seems to have problems writing them. This may cause some confusion for other applications that read OOXML files generated by Excel.

In any case, the rule for storage, as stated in the OOXML specification (“as accurately as possible”) seems insufficiently defined, especially for a standard where numbers are important.

Argh. I think that OOXML is bad too, but all this hoopla about floating point is totally stupid. Because ODF has the same “flaw”. Don’t believe me ? Check it!

1. Create spreadsheet in OOo with 123.123 value in cell. Save it.2. Open content.xml and replace 123.123 with 1234.12339999998993. Open .ods file. Watch how 1234.1233999999899 was magically converted to 1234.123399999994. Open content.xml and replace 123.123 with 1234.12339999999995. Open .ods file. Watch how 1234.1233999999999 was magically converted to 1234.1234

So either ODF and OOXML both are acceptable here or both are bogus. If you’ll read IEEE 754 standard you’ll find why these numbers are read this way and why the question “is 1234.1234 the same as 1234.1233999999999, at least to the limits of machine precision?” which you’ve raised is totally bogus. The answer is “may be, may be not”. WTF: it’s simple question, right ? Answer: nope, it’s not a simple question. If you’ll take a look on IEEE754 (both OOo and Excel are using IEEE754 since x86 API is using IEEE754) you’ll find that 1234.1233999999999 lies between 0x4093487e5c91d14d and 0x4093487e5c91d14e while 1234.1234 lies between 0x4093487e5c91d14e and 0x4093487e5c91d14f. So they can be the same, they can be different. And yes, every read/write cycle can introduce drift – such is life of floating point numbers.

You can not do anything beyond saying “as accurately as possible” in standard if you plan to use it on different CPUs (some without FPU, some with non-IEEE754 compliant FPU). ODF specification does not even say this – it just says that value type “float” is encoded as numeric value – that’s all.

Now the Excel 2007 does have the problem: it uses one routine to convert number for .xlsx file and another one for on-screen edition, but it’s not the problem with OOXML.

I understand on the one way process that the string in the XML will need to be translated into the nearest floating point representation on the machine. No problems there. That’s the limits of finite precision.

However, it is important to point out what ODF does differently here. If I enter 1234.1234 into OpenOffice, it will save that string into the file. So it preserves the user input to full precision. So this means I can save a document and load it on another machine with an application maybe that does extended precision calculations. But the OOXML has already lost precision when it was saved. The way it is implemented in Excel will cause problems, as you point out, when exchanging documents between machines with different FPU’s. A few roundtrips and you may see a significant drift.

But this may be more a limitation of Excel than OOXML. OOXML says to store ‘as accurately as possible’. So an implementation could preserve the user input perfectly if they wanted to.

OpenOffice.org also “lost 4th decimal place”. The data was also changed after it was entered. The same thing will happen if you’ll try to process the ODF file with “1234.123399999999” as office:value – it’ll not survive the roundtrip and will be converted to “1234.1234”. And even difference is of the same order! The only thing it did differently is time: while OpenOffice.org did this immediately Excel did this only at the time when file was saved. Yes, this is problem with Excel. Yes, it’s bad. But is it a problem with OOXML ? Nope.

But this may be more a limitation of Excel than OOXML. OOXML says to store ‘as accurately as possible’. So an implementation could preserve the user input perfectly if they wanted to.

Possible (with some BigNum library), but unfeasible. OpenOffice.org, Excel, Gnumeric – they all are doing the same thing: they only preserve numbers up to what 64-bit double supports.

Why the hell Excel is using 16 digits on screen and 17 in OOXML files ? Easy: compatibility. The number on screen is not “real” number (at least it was in old times when XLS was used). The “real” number is double stored in field – what you see in “edit” field is just an approximation… Take a look on this file: http://vorfeed.canal.googlepages.com/Compatibility.xls

If you’ll open it in Excel you’ll see 1.9429E-16 but OpenOffice.org shows 0 instead. Oops ? Oops indeed…

But! If you’ll edit the file in OpenOffice.org (add new cells, change format of cell, etc – just don’t touch numbers in cells A1 and A2) and save it as .XLS file – you’ll be able to see 1.9429E-16 in Excel later! But if you’ll export this file to ODF and then open it and save as XLS – of course in this case information will be lost. Because OpenOffice.org uses 16 digits precision in ODF file as well as on the screen.

Excel keeps this information around in OOXML (this means no loss of precision after lots of cycles open/save BTW) – but this same approach leads to strange 1234.1234 => 1234.1233999999999 conversion !

NOW can you still say that what Excel does is wrong and what OpenOffice.org does is correct ? I’m not so sure…

I suppose if the standard was written to say “All numeric values shall be stored according to the closest IEEE 754 value” or something like that. But the spec doesn’t say that. So when a user enters 1234.1234 there is no way that you can convince me that storing 1234.1233999999999 is more accurate than storing 1234.1234.

But maybe we’re just speaking past each other when using the word “accuracy”. I’m thinking that what the user enters is the value that must be matched, the strings that they entered, not some internal representation that is neither chosen by the user or defined by the standard.

In any case, I wouldn’t mind seeing OOXML defined more completely in terms of IEEE 754, with correct treatment of NaN, etc.

I think Victor has a technical point, but only to find an excuse to flaws.

By “opening up”, Microsoft was expected to provide a neutral XML representation that reflected what they made available so far through the VBA environment.In VBA, it’s neutral. You enter a value, then get the same back.

The XML representation should do the same, or the specs should make it very explicit that the XML representation is different to what VBA provides. Since none of that is done, either in the documentation or in the implementation, Microsoft Office XML formats are defective by design.

I have to add that having to worry about IEEE artefacts contradicts the platform independence. What if you are using Java for instance? This is a show-stopper in and of itself. The reason why the mess is only because Microsoft “felt it this way” (exact wording from a Microsoft employee in during the US INCITS 5-month review of ECMA 376), just like the Windows dates.

So when a user enters 1234.1234 there is no way that you can convince me that storing 1234.1233999999999 is more accurate than storing 1234.1234.

I will not even try. Because both 1234.1233999999999 and 1234.1234 are equally acceptable representations.

The fact is: there are no 1234.1234 number in computer. If we are talking about doubles we have choice between 0x4093487e5c91d14e and 0x4093487e5c91d14f. 0x4093487e5c91d14e is 1234.12339999999994688550941646099090576171… where 0x4093487e5c91d14f is 1234.12340000000017425918485969305038452148…

Both OpenOffice.org and Excel are choosing 0x4093487e5c91d14e to represent 1234.1234 – and that’s Ok: it’s closer to 1234.1234 then 0x4093487e5c91d14f. Now the next step: we want to create “text form of the number”. We have choice between 1234.1234, 1234.1233999999999, 1234.1234000000001 (and infinite amount of other possibilities). Excel choses to use 1234.1233999999999, OpenOffice.org choses 1234.1234. Both choices are correct and 1234.1234 is more aesthetically pleasing.

So far, so good. But let’s talk about Compatibility.xls, shell we ? Here we have different numbers: 3fb99999999999e2 and 3fb99999999999f0. 3fb99999999999e2 is 0.100000000000001004751837285766669083386… where 3fb99999999999f0 is 0.100000000000001199040866595169063657522…

Excel converts first one to 0.100000000000001 and second one to 0.1000000000000012 (correctly) but OpenOffice.org converts the second one to 0.100000000000001 (thus clearly losing precision).

So here we have the situation where Excel always produces correct results, but sometimes they are not very aesthetically pleasing where OpenOffice.org produces just plan wrong results from time to time. I don’t know about you but I choose correctness over aesthetics 10 times out of 10.

But maybe we’re just speaking past each other when using the word “accuracy”. I’m thinking that what the user enters is the value that must be matched, the strings that they entered, not some internal representation that is neither chosen by the user or defined by the standard.

How about something like 0.10101010101010101010101010101010 ? All formats and all popular spreadsheets are failing this test – thus this question is not even on table. Because all popular spreadsheets are using double internally to store number and not keep data entered by user for future reference. May be it’ll be good idea to introduce some decimal bignum-based arithmetics in spreadsheets to cope – but it goes waaay beyond ODF vs OOXML discussion… The last thing storage format should do is to introduce new concepts not used by existing programs… Sometimes it’s feasible, yes, but it’s still huge change, not some bugfix…

In any case, I wouldn’t mind seeing OOXML defined more completely in terms of IEEE 754, with correct treatment of NaN, etc.

Good idea for ODF too – last time I’ve checked it had so vague text there as to permit storage of 1234.1234 as “1234” ! There are no precision requirements at all and OpenOffice.org does not even store enough information to correctly keep around some numbers representable as 64-bit doubles!

P.S. The real funny thing is that OpenOffice.org can read number like 0.1000000000000012 from .ODS file but can only store them in .XLS file, not in .ODS file. That’s how the Compatibility.xls file was created, actually ! Weird, huh ?

Here’s an example of what Excel’s treatment prevents. Imagine if the spreadsheet truly saved the numbers “as accurately as possible”. This would mean saving the user input exactly as it was entered. The application would store the literal input string. This is 100% accurate and makes no assumptions about floating point models.

As we know, doing arbitrary precision arithmetic in a spreadsheet would be too slow for most purposes. But you could imagine an application, or a web service, or even a mode within a spreadsheet that would recalculate a sheet with bignum or other high precision package. The results would be compared to the normal double precision results and a report made of which calculation cells had the greatest absolute or percentage error. For many applications this would be an invaluable service, to be able to check the numeric stability of the calculations, confirm significance of results, etc. But you can’t do this calculation correctly if you do not store the user input exactly. If you already convert higher precision user input into 64-bit floating point for storage, then this information is lost.

Horribly wrong. There are numerous bignum, etc. packages which can handle exact (or predefined number of decimal points) calculation.

Sorry. You’ve misunderstood me. When I’ve written “there are no 1234.1234 number in computer” I meant “there are no 1234.1234 number in computer at the time file is saved by OOo or Excel“. Of course computer can store 1234.1234 exactly (IBM’s computers did it 40 years ago, it’s not such a big problem). The fact is: neither Excel nor OpenOffice.org do so today.

But as a compromise I am willing to agree with you if you are willing to agree that OOXML is not suitable for financial calculations. Deal?

OOXML can not be used for financial calculations even if you wanted to: it’s just a storage format. If you can use OOXML-based package for financial calculations depends on data model of that package (and yes, both Excel’s data modle and OpenOffice.org data models are not documented – unless you’ll count OpenOffice.org sources as documentation). If the data model used by package is suitable for financial calculations – OOXML will not inhibit you, if it’s not (and I’m pretty sure Excel’s/OpenOffice.org data model is not: they store CURRENCY data as double and work with it as with double!) – then OOXML will not help you.

Imagine if the spreadsheet truly saved the numbers “as accurately as possible”.

Hmm… No need to imagine it. Excel does it today. OpenOffice.org does not.

This would mean saving the user input exactly as it was entered.

No. It does not mean that. What it does mean is “store what information about numbers you have available in this field as accurately as possible”.

You are mixing two questions:

1. How to store information about numbers in memory.2. How to store information about numbers in file.

Answer to the first question among major players (Excel, Gnumeric, OpenOffice, SmartSuite, WordPerfect Office) is unanimous: numbers are stored in memory as 64-bit doubles. On some small devices 32-bit floats will be used instead and some obscure spreadsheets will use BigNum’s. OOXML (and ODF) are not involved in this decision. What’s done – is done. Neither ODF not OOXML mandate anything here. But after this decision is done (and again: both Excel and OpenOffice.org chosen 64-bit doubles) – we have a requirement: to store information about this double in OOXML “as accurately as possible”.

I have to start with an disclaimer: My knowledge about XML is tangential at best but I know a thing or two about working with floating or fixed point (rational) numbers (I’m a computational physicist by trade). If my understanding of the situation is wrong and/or stupid, please feel free to correct me. Thank you in advance !

If I understand the content of the example XML file in Stephane’s 2nd example correctly, then the entered decimal numbers have been stored using their character / string representation ( or to speak with FORTRAN / IDL / etc.: “formated”).

This implies for me, that a XML parser has to process the stored number character by character (e.g. digit by digit) and then convert it anyway in order to arrive at the binary representation of the number (possibly using the already mentioned IEEE754 64Bit double precision model as target). Am I interpreting this correctly ?

If so, then it would seem natural to me (aka the layman) to reuse the initial string representation whenever possible if it comes to storing native input, especially if the “as accurate as possible” part of the specification is followed literally (otherwise I would have expected something like “as accurate as permited by the application and computing platform” or similar in the standard).

As a layman, I would have furthermore assumed that it should be possible for spreadsheet programs to keep track what values are “invariant” to the internal representation of numbers as they are native input input (e.g. are known initially by their string representation) and what content is the result of a calculation or other data manipulation und thus affected by the precision / range / representation model of the used application and computing platform.

If the string representation of “input” would be preserved / reused with 100% fideltiy, then it should be more or less trivial (albeit not gratis in terms of cpu utilisation and/or memory consumption, I admit) to recalculate the “spoiled” content of the document if the spreadsheet is reopened on a platform with different accuracy limitations (und thus a different “as accurate as permitted” treshold).

Background :

I only rarely have to actually work with Office applications ( I’m mainly a TeX/LaTeX user :-) ), but from time to time I export data (typically originating from calculations with 80Bit – Long double precision as base) and perform some simple data analysis tasks or generate some fancy plots using spreadsheet applications. It would have been nice, if OOXML would honor my choice of input (e.g. how many digits I choose to export to an XML file). I admit that I assumed naievly that this would indeed be the case because of the “as accurate as possible” phrasing.

It is pretty clear that what is stored should not in any way depend upon the internal data model of the particular application. This means that all of this talk of IEEE formats is completely irrelevant.

Always, and let me make it clear for you, a-l-w-a-y-s, the stored value should be the exact (string value) representation of what the user entered. The application can do whatever its developers wish with the value internally, but for storage purposes, it should keep the user-entered values.

I think there is something missing, and I’ve not read the relevant sections of the specifications.

If I type “1234.1234” into a cell, “1234.1234” should be what ends up in the file if I don’t do any operation on the cell (e.g. “paste value”).

If the format is changed to “string”, which version will it store?

One of those problems you worry about when you do engineering is that errors accumulate, so you want to preserve precision as far as possible through a calculation chain (in embedded, I’m often using fixed point integers).

What is the wording of OOXML v.s. ODF? Does OOXML (v.s ODF) specifically say it will convert user input that looks like a number into an IEEE734 floating point and then convert it back into a string when storing it or not? And if you change formats, precision, etc. does it say if and how the original user input is changed?

And it seems to be missing that OOXML isn’t “what Excel happens to do”, it is what OOXML states. If OOXML has truly specified all this interdependence and chaos (even with out error or ambiguity), it should be rejected on that basis alone. If OOXML has not done so, then EXCEL isn’t compliant and needs to be fixed. If OOXML is ambiguous, it should be rejected until the ambiguity is resolved, and the resolution should be on technical merit, not what Excel happens to do.

Of course Mr. Rob is right in his explanation of what is happening – and also about the fact that it doesn’t make OOXML or Office any better or worse. I think it is obvious that a tabular data-manipulation and computational editor such as MS Excel would use simple doubles to store numbers during processing.. for spreadsheets for instance 5000 rows long, the memory and processor overhead of storing a string representation of a number as well as the double would make it unusable. Or imagine calculating sums parsing the original strings to total up your spreadsheet. Of course it is logical that for instance Excel could have some special column format which would allow for more precise (from a financial standpoint) decimal-point calculations, where pennies wouldn’t be lost.