The Formula for Failure

It has been a boast for around around 6 months now. Microsoft’s OOXML fully defines spreadsheet formulas, and ODF doesn’t. The Microsoft boosters have been parroting the party line for quite some time.

OOXML devotes 324 pages of the standard to document the formulas and functions.

The original submission to the ECMA TC45 working group did not have any of this information. Jody Goldberg and Michael Meeks that represented Novell at the TC45 requested the information and it eventually made it into the standards. I consider this a win, and I consider those 324 extra pages a win for everyone (almost half the size of the ODF standard).

As far as those 6,000 pages of specs is concerned, there are 350 pages in the OpenXML spec alone — half of the entire ODF spec — just to describe spreadsheet capabilities, which ODF doesn’t have, Paoli says. For example, ODF can’t describe or calculate a formula in a spreadsheet.

“It may sound amazing. They are working on it now. But the current standard doesn’t have it,” Paoli tells me.

There are many other examples, if you care to seek them out. But what you will not find is an examination of what OOXML actually specifies for spreadsheet formulas, or confirmation that it was done sufficiently. Maybe the assumption is that this would be a trivial task, documenting Excel’s behavior? What could possibly go wrong?

Let’s find out.

First, let’s take the trigonometric functions, SIN (Part 4, Section 3.17.7.287), COS (Part 4, Section 3.17.7.50) and TAN (Part 4, Section 3.17.7.313). Hard to mess these up right? Well, what if you fail to state whether their arguments are angle expressed as radians or degrees? Whoops. Same problem for the return value of the inverse functions, ASIN (Part 4, Section 3.17.7.12), ACOS (Part 4, Section 3.17.7.4), ATAN (Part 4, Section 3.17.7.14), and ATAN2 (Part 4, Section 3.17.7.15). It is hard to have interoperable versions of these functions if the units are not specified. What kind of review in Ecma would miss something so simple?

The AVEDEV function (Part 4, Section 3.17.7.17) should return the average deviation of a list of values. However, the formula given for this function is actually for calculating the number of combinations of n things taken k at a time. Nice formula, though. Jakob Bernoulli would be proud. But anyone using an OOXML spreadsheet application that follows this standard will be perplexed at the values returned by their AVEDEV function. Did these formulas get any expert review in Ecma?

It is hard to have confidence in the CONFIDENCE function (Part 4,Section 3.17.7.47). It is said to return the confidence interval around a sample mean given an alpha value, a standard deviation and a sample size. The problem is that this problem is under-defined. One must make an assumption, not stated here, as to the shape of the data distribution. Is it normally distributed data? Exponentially distributed? Weibull distribution? The standard does not define the meaning of this function sufficiently for one to implement it.

The CONVERT function (Part 4, Section 3.17.7.48) converts from one unit to another. Some conversions explicitly allowed include liquid measure conversions such as from liters to cups or tablespoons. But whose cup and whose tablespoon? Traditional liquid measures vary from country to country. In the US, a cup is 8oz, except for FDA labeling purposes when a cup is 240ml. But in Australia a cup is 250ml and in the UK it is 285ml. Similarly a tablespoon has various definitions. OOXML is silent on what assumptions an application should make. I guess I won’t be using OOXML to store my recipes, and certainly not to calculate medical doses!

Almost every one of the financial functions in OOXML depends on a “day count basis” flag, such as US (NASD) 30/360, Actual/Actual, Actual/360, Actual/365, European 30/360. These represent various conventions for how days and months are counted. The problem is that the OOXML standard does not define these conventions, nor does it point to an authority for their definition. There are subtle behaviors here, especially when dealing with leap years and Excel’s deviant treatment of dates in the year 1900. So lack of detailed definitions in this area make it impossible for anyone to rely on identical financial calculations from different OOXML implementations. This, in a field where being off by a penny can cause problems.

Almost 30 spreadsheet functions are broken in this way.
(What do you call a scientist whose calculations are off by 50%? A cosmologist. What do you call an accountant whose calculations are off by 1%? A crook.)

The NETWORKDAYS function (Part 4, Section 3.17.7.344) seems simple enough. It returns the number of workdays (non weekend days) between two dates. Simple enough. Unless you live in the Middle East. The problem is that this function doesn’t provide a facility for distinguishing the different weekend conventions. I may have a weekend on Saturday & Sunday, but a colleague in Tel-Aviv might have off Friday and Saturday, while in Cairo it might be Thursday and Friday. This function lacks the adaptability to deal with this important cultural difference. Saying that the definition of the weekend is implementation- or locale-dependent won’t work either. I may be a French company in Paris dealing with contractors in Algeria. I need to have a French spreadsheet calculate schedules for workers at various locations and be able to exchange it with others offices using other OOXML applications and expect that they will get the same answer. Lacking cultural adaptability, OOXML fails approximately a billion people here.

Another example. Several of the statistical functions in OOXML are defined incorrectly. Take for example, the ZTEST function (Part 4, Section 3.17.7.352). The key error is following the formula where it says, “where x is the sample mean.” The problem is that x-bar is the sample mean, not x. Someone who implements according to the text will give their users the wrong answer. A similar error is repeated in 8 other statistical functions. Certainly this is a typographical error, but this error changes the answer. Remember, this is an approved Ecma Standard and a proposed ISO Standard, not a 4th grade school essay. Denmark and Massachusetts have already said they will adopt OOXML for official business. Spelling counts. Providing the right formula and the right description counts. Copy and paste errors should have been taken care of back during the Ecma review.

I’ve submitted these spreadsheet formula issues, and many others, to INCITS V1, for consideration in determining the US position on the OOXML ISO ballot, but we never got to them during our two-day meeting in DC a couple of weeks ago, and may not get to them at all. There are simply too many other issues to read through and discuss. But I thought it was important to bring up these formula issues in particular, since Microsoft seems especially proud of their work in this area, delusions of adequacy which on reflection must now seem unwarranted. I’m especially concerned with the financial functions, since they are outside my area of expertise and may have additional errors that I missed.

So what is ODF doing about formulas? We’re continuing to work on them. Rather than rush, we’re doing careful, methodical work. We’re documenting the functions in great detail. Where we have the choice between the common naive formula for a function and one that is numerically stable, we’re documenting the stable function. For the NETWORKDAYS function, we created an optional extra parameter, so a user can pass in a flag that tells what their weekend conventions are. We have a professor of statistics reviewing our statistics functions for completeness and accuracy. We’re verifying our assumptions about financial functions by referring to core specifications from groups like the ISDA and the NASD. We’re creating a huge number of test cases and checking them with Excel and other applications.

Under Sarbanes-Oxley, a CEO or CFO puts himself at personal risk if he signs off on financial numbers derived from processes and tools that he knows to give erroneous results. So we utterly reject a rushed process that has lead to an Ecma Standard which incompletely and incorrectly defines spreadsheet functions. Some things are worth taking the time to do right.

As I’ve shown, in the rush to write a 6,000 page standard in less than a year, Ecma dropped the ball. OOXML’s spreadsheet formula is worse than missing. It has incorrect formulas that, if implemented according to this standard, would raise important health, safety and environmental concerns, aside from the obvious financial risks of a spreadsheet that calculates incorrect results. This standard is seriously messed up. Shame on all those who praised and continue to praise the OOXML formula specification without actually reading it.

At least with regard to trig functions, at some point customary practice should be taken into account. I work with a lot of numeric software packages, and all of them take radians as arguments to SIN, COS, TAN.

Still, explicitness wouldn’t have hurt.

Re hubert’s comment: what documentation of the Lotus SmartSuite file formats? There’s stuff floating around the web on .WK1, but I’ve yet to see anything on .123.

In a numeric analysis package, I’d certainly expect angles to be in radians. But in a consumer-oriented spreadsheet? I’m not sure I have strong expectations there. Certainly the most commonly used angles have integral representations in degrees and so can be stored without loss of precision in XML that way. In any case, this needs to be explicit.

As far as the 1-2-3 file format, I’m not sure that is relevant. We’re not trying to make an ISO standard of the the 1-2-3 format specification.

Regarding the trig functions, what happened to the ‘OO’ part? Shouldn’t sin() take an object of type Angle and asin() return an object of type Angle… then, if you choose to view the object in radians or degrees that’s your choice. It is poor design to have sin() take a scalar argument and rely on documentation to tell the user the convention that is used to interpret the scalar when compared to a stronger typing.

“Wow, OOXML spreadsheet forumlas may cause loss of life if implemented as described by the spec?

I’m not an expert on these things, but thats a very, very serious allegation.”

The OOXML formula functions give incorrect answers to simple math functions. What happens if you’re a doctor, prescribing medication for an illness? A small error could mean the difference between getting well and being poisoned.

“The OOXML formula functions give incorrect answers to simple math functions. What happens if you’re a doctor, prescribing medication for an illness? A small error could mean the difference between getting well and being poisoned”

Hmm, surely that implies that this is what is already happening today r.e. older versions of Excel (assuming OOXML is born out of the older binary formats). But some sort of proof would be nice dont you think?

Note that I am not pointing out instances where OOXML documents legacy Excel bugs. That certainly happens in other places, such as with leap year calculations, but that is not what this post is about. What I’m talking about here is where the OOXML specification is clearly incorrect, where due to hasty preparation and insufficient review the text of OOXML is riddled with errors.

The Therac-25 problem is a notorious machine that yes, killed people, because of a simple software bug (where backspacing didn’t clear all the underlying data so the number for the dose of x-rays was wrong).

Given the large number of errors or ambiguities in the OOXML spec, would you let your life depend on the result in a cell calculated by any implementation of OOXML?

But Kudos to the ODF team – I remember a comment when it was adopted that it didn’t specify formulae, but had some working documents. And it apparently has become a more difficult job, but when ODF finally submits its definition to the standards bodies, it will be an air-tight, unambiguous document with all the references so anyone will know exactly what should come out.

I remember when the IEEE floating point specs were out and every device specified the accuracy of the calculations, the exact definitions and behaviors of NaNs and when you divided by infinity or zero. And this was just the 4-function calculator stuff.

On one project – it was a simple trip computer – I went and documented every routine, variable, and transform function to the point of using TeX to enter the formulae for the things which weren’t linear. It only took a few days, but later I was told that every software design document should be that good. It wasn’t even that difficult or tedious.

Microsoft, as usual, has chosen to boast about something that they should have been to embarrassed about and should have hoped no one would notice. Maybe they had Clippy go through and dump the help files for formulae right into the OOXML spec.

Microsoft has billions of dollars and a lot of smart people. I find it contemptible that they wouldn’t even take the time to polish, much less develop a shining example of what even a critical part of a specification should be. We’re talking fundamental mathematics here, not font aliasing. They could do what ODF is doing and maybe more as they lack no resource except the will to do things – even important public things – with quality and the right way.

Their attitude in everything seems to be “we don’t care, we don’t have to”.

Need some consistency in the criticisms here. If one person wants to claim angles in degrees make more sense in consumer-targeted software, then so does the colloquial equivalence of weight and mass. Either spreadsheets are specialized mathematical software (in which case RADIANS RULE!) or it’s consumer software (in which case expect colloquial terminology at odds with technically correct usage).

I’m surprised no one has yet bitched, whined or moaned about Excel’s, er, OOXML’s nonstandard arithmetic operator precedence in which =-3^2 returns 9 rather than -9, which is good for a long newsgroup thread every calendar quarter or so. Maybe no one here’s complaining because OOo Calc does the SAME STUPID, NONSTANDARD THING. And why would they do that? Perhaps compliance with the DE FACTO STANDARD of how Excel ACTUALLY WORKS has so much OBVIOUS, PRACTICAL VALUE that the OOo developers can’t work up the courage to ignore it? Perhaps because StarOffice (back in the old days) made similarity to Excel a primary design goal, and simply screwed up their CEILING function’s implementation. But why not make a virtue out of past screw-ups?!

Forgive me if this has been mentioned here or elsewhere already, too many issues to make sure I’ve read them all.

Also checkout the explicit definitions for cell references. Both A1 style and R1C1 style allow for “decimal numbers” with varying restrictions like positive or optional sign, etc. (sections 3.17.2.3.1 and .2). The definition of a decimal number is found in the last paragraph of 3.17.5.3 Lexical Representation. It is:

A Decimal Number has a lexical representation consisting of a finite-length sequence of decimal digits (#x30–#x39) separated by a period as a decimal indicator. An optional leading sign is allowed. If the sign is omitted, “+” is assumed. Leading and trailing zeroes are optional. If the fractional part is zero, the period and following zero(s) can be omitted. [Example: -1.23, 12678967.543233, +100000.00, 210. end example]

So the column in an A1 reference or either of the numbers in a R1C1 reference can be fractional by definition. I haven’t found (after a very brief search) any mention of how to deal with decimal numbers in references, such as round (up|down) or throw an error like #NA.

I know they probably mean a decimal integer, but if that’s what they mean then that’s what should have been specified.

On the other hand, ODF doesn’t seem to define the number portion of a cell reference, but at least it doesn’t specify that fractional portions are allowed.

@Harlan:Last time I checked a negative number times a negative number results in a positive number. So -3^2 means -3*-3 which equals 9. Where’s the problem?

“I’m surprised no one has yet bitched, whined or moaned about Excel’s, er, OOXML’s nonstandard arithmetic operator precedence in which =-3^2 returns 9 rather than -9, which is good for a long newsgroup thread every calendar quarter or so.”

That would be a unary minus which is correctly identified by the parser as -3 being squared. The order of precedence here is valid AND correct.

To get -9 you would enclose the power computation in parentheses: -(3^2)

Rob, it is not a matter of pushing the Smart Suite file format to standards, it is a matter of documenting it so that legacy can be convert to standard file format. This is all straight down what you have been advocating for a while.

To the people who don’t understand the problem with =-3^2 returning 9:

Yes, if unary minus is considered part of the number, then it’s natural to interpret it as (-3)^2 = 9. The problem comes when x = 3, and the formula is =-x^2. Now what should the answer be?

And let’s go back to CUSTOMARY USAGE. I’ve gone through this in a few newsgroup threads, but Excel shares the semantics that =-x^2 returns 9 when x = 3 (or x = -3) with COBOL and maybe one SQL dialect. Nearly all other programming languages, non-Excel-clone spreadsheets, stats packages and DBMSs would return -9 since they adhere to BODMAS (i.e., textbook and academic standard) arithmetic operator precedence in which exponentiation happens before anything other than parentheses grouping.

Excel got this wrong on its own. StarOffice got it wrong because it copied Excel. OOo Calc gets it wrong because it copied StarOffice. No doubt now y’all are going to try to make this a virtue rather than admit the possibilities that (1) you could ever do anything wrong, (2) you were just trying to provide backwards compatibility, or (3) there’s overwhelming practical value to doing the same thing as Excel does.

You want credibility, admit your own shortcomings. Try being HONEST ABOUT YOURSELVES for a change.

Specific functions should never have been specified. Function representation, to the extent it differred between built-in, user-defined via built-in programming/scripting facilities (e.g., VBA udfs in Excel), or add-on via DLLs (COM or otherwise), should have been in the FILE FORMAT spec. This leaves the question whether they should have specified arithmetic operator precedence. Myself, I think it was a mistake to include even that in the specification.

Which presumably means that a calculation done with formulas vs. VBA probably doesn’t yield the same results.

Also, Hubert, all IBM need do is provide a mechanism to convert SmartSuite docs to ODF format consistently and cleanly. If they take responsibility for providing a bridge from a proprietary legacy to a current standard, that is VASTLY more useful than simply documenting a proprietary standard and leaving it up to the customer.

Re Eddie Edward’s comment: ‘any floating-point number can be stored without loss of precision in XML.’ Definitions are essential. If he means a 64-bit IEEE double precision floating point number could be stored exactly as a string of 64 1s and 0s, he’s right. But if he means arbitrary REAL numeric values, such as rational fractions of PI (as all integer degrees would be when converted into radians), he’s wrong. PI, being a transcendental number, has no finite digit representation in any radix, so unless ODF handles infinitely long XML tags in infinitely long ODF files, there’d be potential loss of precision due to practical necessity.

Do I understand correctly that ISO is now asked to redefine mathematics?

The chosen standard form also makes it impossible to implement a spreadsheet with the mathematically correct version of these functions in the future.

If my Spreadsheet (or other) program wants to be mathematically correct, it will be unable to store it’s results in an OOXML file as this doen not allow to flag the mathematically correct function. Any standard should at least allow to store the correct functions.

OOXML also redefines many definitions in other ISO standards (eg, ISO C).

I also do not understand people who claim that the ISO STANDARD should contain errors because this is easier for the APPLICATIONS.

Whether the stored SIN() is in degrees, 400degrees, or radians, or if CEIL() works the wrong way round is completely irrelevant for the application. It can present the USER with every representation it wants by just converting.

But this comes down to the same stupid discussion about MS Office being competely unable to convert date formats and just simply was “forced” to define 1900 as the beginning of all time and a leap year into an ISO standard.

I just think Rob was right, as confirmed by Bill Hilf, when he wrote that OOXML is nothing but a detailed description of one version of the inards (DNA sequence) of Office 2007. These Excel functions are just one proof in very many.

What i found in the little that ODf writes on spreadsheet cell is that it has seperated the presentation content and the value of the cell. So each cell with a value in an ODF spreadsheet now has two values. That is weird. It is a source of confusion and could easily lead to the creation of spreadsheets that have non-matching values. ODF does not state how to deal with non-matching values in spreadsheet cell. It would have been better if a cell had only one value and that style elements would decide on the presentation of that value.

The general idea is that a spreadsheet document stores the cell contents (a literal value or a formula) as well as the last-calculated value. Most spreadsheets do it this way (Excel, 1-2-3, OpenOffice, etc.). The benefit is that this allows a lightweight viewer to read and display a spreadsheet without having the overhead of a full spreadsheet calculation engine. The lightweight viewers can ignore the formulas and just display the results that were saved with the file. Without this, you wouldn’t have the ability to easily convert an ODF spreadsheet into an HTML table.

Another inconsistency (product of the rush to get the standard out) can be found in the formula for HYPGEOMDIST (3.17.17.145), where the formula is given, but its terms are not explained. It even contains a variable called lambda, which corresponds to the mean parameter of the function.

Are you sure about the ZTEST functions? What I’ve found in Wikipedia seems to show that OOXML is correct on this one…(I’m building a collection of OOXML shortcomings in spanish, and wanted to double check this)

Anders, the error in ZTEST is the text that says, “where x is the sample mean…”. It should say. “where x-bar is the sample mean”.

As you illustrte with you HYPGEOMDIST comment, it is important that where mathematical notation is used that the variables of the equation are correctly and unambiguously mapped to the spreadsheet function’s arguments.

I am kind of surprised to not see this mentioned, but with these broken formula in the specification (and other obvious flaws), it will actually be quite easily provable that MSOffice does not faithfully follow the specification. This will mean there is not even ONE implementation of this ‘standard’ in the wild.