How many defects remain in OOXML?

DIS 29500, Office Open XML, was submitted for Fast Track review by Ecma as 6,045 page specification. (After the BRM, it is now longer, maybe 7,500 pages or so. We don’t know for sure, since the post-BRM text is not yet available for inspection.) Based on the original 6,045 page length, a 5-month review by JTC1 NB’s lead to 48 defect reports by NB’s, reporting a total of 3,522 defects. Ecma responded to these defect reports with 1,027 proposals, which the recent BRM, mainly through the actions of one big overnight ballot, approved.

So what was the initial quality of OOXML, coming into JTC1? One measure is the defect density, which we can say is at least one defect for every 6045/1027 = 5.8 pages. I say “at least” because this is the lower bounds. If we believed that the 5-month review represented a complete review of the text of DIS 29500, by those with relevant subject matter expertise, then we would have some confidence that all, or at least most, defects were detected, reported and repaired. But I don’t know anyone who really thinks the 5-month review was sufficient for a technical review of 6,045 pages. Further, we know that Microsoft worked actively to suppress the reporting of defects by NB’s. So the actual defect density is potentially quite a bit higher than the reported defect density.

But how much higher? This is the important question. It doesn’t matter how many defects were fixed. What matters is how many remain.

There are several approaches to answering this question. One approach is to look at defect “find rates”, the number of defects found per unit of time spent reviewing, and fit that to a model, typical an S-curve (sigmoid) and use that model to predict the number of defects remaining. However, we have no time/effort data for the DIS 29500 review, so we don’t have enough data to create that model. Another approach is to randomly sample the post-BRM text and statistically estimate the defect density by this sample.

Are there any other good approaches?

Here is the plan. I will use the second approach. Since I do not actually have the post-BRM text, I need to make some adjustments. I’ll start with the original text, in particular Part 4, the XML reference section, at 5,220 pages, where the meat of the standard is. I’ll then create a spreadsheet and generate 200 random page numbers between 1 and 5,220. For each random page I will review the clause associated with that page and note the technical and editorial errors I find. I will then check these errors to see if any of them were addressed by BRM resolutions.

Based on the above, I will be able to estimate two numbers:

The defect density of the text, both pre and post BRM

The fraction of defects which were detected by the Fast Track review.

So if I find N defects, and 0.9N of those issues were already found during the Fast Track review and were addressed by the BRM, then we can say that the Fast Track procedure was 90% effective in finding and removing errors. Some practitioners would call that the defect removal “yield” of the process. But if we find that only 0.1N of the errors were reported and addressed by the BRM, then we’ll have a different opinion on the sufficiency of the Fast Track review.

Clear enough? Microsoft is claiming something like 99% of all issues were resolved at the BRM. So let’s see if we get anything close.

I’m not done with this study yet. I’m finding so many defects that recording them is taking more time than finding them. But since this is topical, I will report what I have found so far, based on the first 25 random pages, or 1/8th completion of my target 200. I’ve found 64 technical flaws. None of the 64 flaws were addressed by the BRM. Among the defects are some rather serious ones such as:

storage of plain text passwords in database connection strings

Undefined mappings between CSS and DrawingML

Errors in XML Schema definitions

Dependencies of proprietary Microsoft Internet Explorer features

Spreadsheet functions that break with non-Latin characters

Dependencies on Microsoft OLE method calls

Numerous undefined terms and features

As I said, this study is still underway. I’ll list the defects I’ve found so far, and add to it as I complete the task over the next few days.

Page 692, Section 2.7.3.13 — no errors found

Page 1457, Section 2.15.3.45 — This is a compatibility setting which creates needless complexity for implementers who now must deal with two different ways of handling a page break, one in which a page break ends the current paragraph, and another where it does not. This is not a general need and expresses only a single vendor’s legacy setting.

Page 490, Section 2.4.72 — This defines the ST_TblWidth type, used to express the width of a table column, cell spacing, margins, etc. The allowed values of this type express the measurement units to be used: Auto, Twentieths of a point, Nil (no width), Fiftieths of a percent. I find these choices to be capricious and not based on any sound engineering principle. It also mixes units with width values (Nil) and modes (auto). This should be changed to allow measurements in natural units, such as allowed in XSL-FO or CSS2, such as mm, inches, points, pica. Also, do not mix units, values and modes in the same attribute. Nil is best represented by the value 0 and Auto should be its own Boolean attribute.

Page 328, Section 2.4.17 — The frame attribute description says it “Specifies whether the specified border should be modified to create a frame effect by reversing the border’s appearance from the edge nearest the text to the edge furthest from the text.” This is not clear. What does it mean to reverse a border’s appearance? Are we doing color inversions? Flipping along the Y-axis? What exactly? Also a typographical error: “For the right and top borders, this is accomplished by moving the order down and to the right of its original location.” Should be “moving the border down…” Also, it is not stated how far the border should be moved.

Page 1073, Section 2.14.8 — This feature is described as: “This element specifies the connection string used to reconnect to an external data source. The string within this element’s val attribute shall contain the connection string that the hosting application shall pass to a external data source access application to enable the WordprocessingML document to be reconnected to the specified external data source.” Since connection to external data typically requires a user ID and a password, the lack of any security mechanism on this feature is alarming. The example given in the text itself hardcodes a plain-text password in it the connection string.

Page 4387, Section 6.1.2.3 — For the “class” attribute it says “Specifies a reference to the definition of a CSS style.” The example implies that some sort of mapping will occur between CSS attributes and DrawingML. But no such mapping is defined in OOXML. The “doubleclicknotify” attribute implies some sort of event model that us undefined in OOXML. How do you send a message for doubleclicknotify? Why do we describe organization chart layouts here when it is not applicable to a bezier curve? What happens if this shape is declared to be a horizontal rule or bullet or ole object? The text allows you label it as one of these, but assigns no meaning or behavior to this. Why do we have an spid as well as an id attribute? The “target” attribute refers to Microsoft-specific I.E. features such as “_media”. Although the text says that control points have default values, the schema fragment does not show this.

Page 3164, Section 4.6.88 — This and the following two elements are all called “To” but this seems to be a naming error. 4.6.89 is essentially undefined. What does “The element specifies the certain attribute of a time node after an animation effect” mean? It doesn’t seem to really signify anything. Ditto for 4.6.90.

Page 5098, Section 7.1.2.124 — The example does not illustrate what the text claims it does. The example doesn’t even use the element defined by this clause.

Page 4492, Section 6.1.2.11 — The “althref” attribute is described as “Defines an alternate reference for an image in Macintosh PICT format”. Why is this necessary for only Mac PICT files? Why would “bilevel” necessarily lead to 8 colors? We’re well beyond 8-bit color these days. “blacklevel” attribute is defined as “Specifies the image brightness. Default is 0.” What is the scale here? This needs to be defined. Is it 0-1.0, 0-255 or what? And what is “image brightness” in terms of the art? Is this luminosity? Opacity? Is this setting the level of the black point? For “cropleft”, etc. — what units are allowed? (implies %) How does “detectmouseclick” work when no event model is defined? “emboss effect” is not defined. “gain” has the same problem as “blacklevel” — no scale is defined. This element has two different id attributes in two different namespaces, with two different types. “movie” attribute is described as “Specifies a pointer to a movie image. This is a data block that contains a pointer to a pointer to movie data”. Excuse me? “A pointer to a pointer to movie data”? This is useless. The “recolortarget” example appears to contradict the description. It shows shows blue recolored to red, not black. The “src” attribute is said to be a URL, yet is typed to xsd:string. This should be xsd:anyURI.

Page 1431, Section 2.15.3.30 — no errors noted

Page 3405, Section 5.1.5.2.7 — The conflict resolution algorithm should be normative, not merely in a note.

Page 875, Section 2.11.21 — Instead of saying that the footnote “pos” element should be ignored if present at the section level, the schema should be defined so as to not allow it at the section level. In other words, this should be expressed as a syntax constraint.

Page 1955, Section 3.3.1.20 — This facility for adding “arbitrary” binary data to spreadsheets is said to be for “legacy third-party document components”. No documentation or mapping for such legacy components has been provided, so interoperability with this legacy data cannot be achieved. Why isn’t this expressed using the extension mechanisms of Part 5 of the DIS?

Page 4526, Section 6.1.2.13 — The “allowoverlap” attribute is not sufficiently defined. In particular, what determines whether the object shifts to right or left? ST_BWMode is not adequately defined. For example, one option is “Use light shades of gray only”. How light? And what is the difference between “hide” and “undrawn”? Also, concept of “wrapping polygon” is not sufficiently defined. For example, what is the wrapping polygon for an oval? The purpose of “dgmlayoutmru” is obscure. Wouldn’t the most-recently-used layout option be the one which is actually in use, “dgmlayout”? The “dgmnodekind” attribute is undefined, said to be “application-specific”. Is interoperabilty not allowed? The text seems to imply that applications must use application-specific values. The “href” attribute is give a string schema type. Shouldn’t this be xsd:anyURI. The “id” attribute is said to be a “unique identifier”. Unique in what domain? Among shapes of this type? Among all shapes? All shapes on this page? Among all ID’s in the document? The “preferrelative” attribute is not sufficiently defined. Where is the original size stored? After what reformatting? This appears to be a specification for runtime behavior, not a storage artifact. But it is not clear what is required. For the “regroupid”, where is the list of these possible id’s? The Hyperlink targets _media and _search are Internet Explorer proprietary features.

Page 1193, Section 2.15.1.39 — no errors noted

Page 1459, Section 2.15.3.46 — no errors noted

Page 2671, Section 3.17.7.150 — no errors noted

Page 2347, Section 3.10.1.69 — An “AutoShow” filter is not defined in this standard, though it is called for in several places of this section. “Average” aggregation function is not defined. In fact, none of these aggregation functions are defined. Although some have common mathematical definitions, in a spreadsheet context it is critical to make an explicit statement on treatment of strings, blanks, empty cells, etc. For dataSourceSort, what type of sort is required? Lexical or locale-sensitive? This element seems to mix field-specific settings, like dragToCol with pivotTable-wide settings like hiddenLevel. This will result in large data redundancy as settings like hiddenLevel are stored multiple times, once for each pivotField. “Inclusive Mode” is not defined. “Measure based filter” is not defined. “AutoSort” mode is not defined. The resolution of pivot table versus cell styles is ambiguous. “If the two formats differ, the cell-level formatting takes precedence.” Is this negotiation done at the level of the entire text style? Style ID? Or at the attribute level? “Outline form” is not defined. “server-based page field” is not defined. (what is a page field?) “member caption” is undefined.

Page 2885, Section 3.18.51 — The values of the given type (ST_OleUpdate) are explicitly tied to the Microsoft Windows OLE2 technology via the two method calls IOleObject::Update or IOleLink::Update

Page 3951, Section 5.5.3.4 — The base values “margin” and “edge” are ambiguous. Is it specifying positioning from the left or right page edge?

Page 2710, Section 3.17.7.200 — The description of “lookup-vector” is insufficient. It seems to be saying that the range should be sorted. Is this really correct? Spreadsheet functions typically do not have side effects. Also, the sorting procedure is explicitly defined only defined for the Latin alphabet. What about the rest of allowed Unicode characters, including the C0 control characters which are allowed in SpreadsheetML cell contents? Where are they sorted?

Page 934, Section 2.13.5.5 — The “id” attribute is required to be unique, but it is not specified over what domain it must be unique.

Page 607, Section 2.6.2 — What does “reversing the borders’s appearance mean”? How much offset is required for a shadow?

Page 201, Section 2.3.2.19 — This feature allows the suppressing of both spell and grammar checking for a text run. These should be two different settings, one for spelling and one for grammar proofing. There are many cases where it is important to check one, but not the other, just as in content comprised of sentence fragments, which are not grammatically complete, but where correct spelling is desired.

Page 1240, Section 2.15.1.74 — This setting specifies that the document should be saved into an undefined invalid XML format. But it is not stated how an XSLT transfor can be applied to an OOXML document, since OOXML is a Zip file containing many XML documents. So what exactly is the specified XSLT applied to?

That’s as far as I’ve gone. But this doesn’t look good, does it? Not only am I finding numerous errors, these errors appear to be new ones, ones not detected by the NB 5-month review, and as such were not addressed in Geneva. Since I have not come across any error that actually was fixed at the BRM, the current estimate of the defect removal effectiveness of the Fast Track process is < 1/64 or 1.5%. That is the upper bounds. (Confidence interval? I’ll need to check on this, but I’m thinking this would be based on standard error of a proportion, where SE=sqrt((p*(1-p))/N)), making our confidence interval 1.5% ± 3%) Of course, this value will need to be adjusted as my study continues. However, it is starting to look like the Fast Track review was very shallow and that detected only a small percentage of the errors in the DIS.

[20 March Update]

As one commenter noted, the page numbers I’m using above are PDF page numbers, not the page numbers on bottom of each page. If I used the printed pages then I would need to deal with all the Roman numeral front matter pages as an exception. Simpler to just use the one large domain of PDF page numbers.

PDF Page Number = Printed Page Number + 7

I will continue to report new defects, according to the original random number list I generated. I’ll update the statistics every 25.

Here’s some more for today:

Page 4192, Section 5.8.2.20 — “fPublished” attribute is defined as “Specifies whether the shape shall be published with the worksheet when sent to the spreadsheet server. This is for use when interfacing with a document server.” What worksheet? This section is in the DrawingML reference material. Charts could appear in presentations as well. This should not be limited to worksheets. Also what is a “spreadsheet server”? No such technology has been defined in this standard. Also no protocol has been defined for publishing to a spreadsheet server. Is this some proprietary hook for SharePoint? The “macro” attribute allows the storage of application-defined scripts. We are told that the macro “should be ignored if not understood.” However there is no mechanism for determining what language the script is in. How do we know if we understand the macro? Content sniffing? Attempt to execute it and see if we get a runtime error? But by that time, once we find out that we do not understand it, it is too late to ignore the macro. We may have already triggered runtime side effects. What we really need here is some way to declare what scripting language is being used, via a namespace or an additional attribute like “lang”.

Page 3526, Section 5.1.5.4.21 — The “algn” attribute specifies the text alignment. Allowed values include left, right, center, justified, etc. However, what is lacking is “start” and “end” alignment, which are sensitive to writing direction and are part of internationalization bets practices, for example, XSL-FO. When translating a document between RTL and LTR systems, the approach used by OOXML will harder to deal with and be more expensive to translate, since the translator will need to manually play with styles on not just perform an semi-automated translation.

[End Update]

I’ll continue to review the remaining 173 pages of my random sample and update the numbers and the defect list as I go. If you want to play along at home, the upcoming random page numbers will be:

Those OLE references are of concern. Ecmas response to DK-0031 was to include faulty Bonobo & KParts examples which were approved at the BRM.

I’m told now though that the editor is at liberty to remove all OLE references because of the expressed notion (in some responses) that they’ll remove it.

1) Are they at liberty to?2) Considering that they were not able to successfully remove it during the regular fasttrack process (before Jan 14th) how do we know that they’ll be able to for the final text?

OK, this is what i found ( in a couple of hours of work, because i have better things to do that unpaid Microsoft/ECMA/ISO homework )

note 1 : error not fixed in BRMnote 2: this is the legacy clause numbering, the definitive clause number ( and the whole DIS 29500 beast ) is still unknown and was not published for review

————Part 4: 2.2.1 background (Document Background) reads:

“themeTint (Border Theme Color Tint)Specifies the tint value applied to the supplied theme color (if any) for this background.If the themeTint is supplied, then it is applied to the RGB value of the theme color (from the theme part) to determine the final color applied to the document’s background.The themeTint value is stored as a hex encoding of the tint value (from 0–255) applied to the current border.

[Example: Consider a tint of 60% applied to a border in a document. This tint is calculated as follows:

Sxml = 0.4 * 255 = 102 = 66(h)

The resulting themeTint value in the file format would be 66. end example]”

Error: the example says “a tint of 60%”, but the formula shows 0.4. All formulas in OOXML should be reviewed and should provide correct numbers.————

this is one of many found ( in a couple of hours of work, because i have better things to do that unpaid Microsoft/ECMA/ISO homework )

note 1 : error not fixed in BRMnote 2: this is the legacy clause numbering, the definitive clause number ( and the whole DIS 29500 beast ) is still unknown and was not published for review:

Part 4: 2.7.3.17 style (Style Definition) reads:

“…General style properties refers to the set of properties which can be used regardless of the type of style; for example, the style name, additional aliases for the style, a style ID (used by the document content to refer to the style), if style is hidden, if style is locked, etc

Above the formatting information specific to this style type are a setof general style properties which define information shared by all style types. end example]“

So, a reference could be given in 2.7.3.17 to 2.7.3 to avoid unnecessary duplication of text.

Given the size of DIS 29500 and for the sake of clarity, easy of reading and understanding of the specification, all this kind of examples ( duplicated or that gives poor value to the reader ) should be reviewed, re-evaluated and eventually removed from DIS 29500.

A final text with this editions impacted should be submitted to NBs for review.

This is one of many found ( in a couple of hours of work, because i have better things to do that unpaid Microsoft/ECMA/ISO homework )

note 1 : error not fixed in BRMnote 2: this is the legacy clause numbering, the definitive clause number ( and the whole DIS 29500 beast ) is still unknown and was not published for review

Part 4, Section 4.2.3 ext (Extension) reads:

“This element specifies an extension that is used for future extensions to the current version of DrawingML. “

The reference to DrawingML seems extraneous to this section ( Section 4 is about PresentationML ).

If this is a typographical error it should be corrected, otherwise the definition of the “ext” element should be clarified.

Errors like this seems to show that this DIS 29500 has been subject to a furious and rushed abuse of “copy and paste”, improper for a text that expects to be awarded with the ISO brand. All the text should be carefully reviewed and this kind of errors corrected.

ISO should warn standards organization who submits text with so much editorial and technical errors because this shows a lack of respect to ISO national bodies members that must review such gobbledegooked text.

“This element specifies an extension that is used for future extensions to the current version of DrawingML. This allows for the specifying of currently unknown elements in the future that will be used for later versions of generating applications...Attributes Description

uri (Uniform Resource Identifier): Specifies the URI, or uniform resource identifier that represents the data stored under this tag. The URI is used to identify the correct ‘server’ that can process the contents of this tag.

The possible values for this attribute are defined by the XML Schema token datatype.

The following XML Schema fragment defines the contents of this element:

“The third issue is that, while writing my proposal, I and my reviewers found 13 additional errors in the original specification. However, national bodies were not allowed to submit new comments (and rightly so, otherwise there would have been total chaos). Therefore, there was no way to submit and correct them.”

This is one of many found ( in a couple of hours of work, because i have better things to do that unpaid Microsoft/ECMA/ISO homework )

note 1 : error not fixed in BRM nor ECMA +2200 pages fixes document.

note 2: this is the legacy clause numbering, the definitive clause number ( and the whole DIS 29500 beast ) is still unknown and was not published for review.

Part 4, Section 2.11.17 numFmt (Footnote Numbering Format) reads:

“This element specifies the numbering format which shall be used to determine the footnote or endnote reference mark value for all automatically numbered footnote and endnote reference marks (those without the suppressRef attribute set).”

The definition of the suppressRef attribute was not found in Part 4 nor in any schema ( a full search of the +6000 pages DIS + schemas + +2200 pages proposed dispositions + +500 pages of BRM pages was performed )

Same problem at Part 4, 2.11.18 numFmt ( Endnote Numbering Format ).

DIS 29500 final text ( still not provided by ECMA ) should be reviewed to found all the undefined terms.

“…This element specifies the direction of the text flow for this paragraph.

If this element is omitted on a given paragraph, its value is determined by the setting previously set at any level of the style hierarchy (i.e. that previous setting remains unchanged). If this setting is never specified in the style hierarchy, then the paragraph shall inherit the text flow settings from the parent section.

[Example: Consider a document with a paragraph in which text should flow bottom to top vertically, and left to right horizontally. This setting would be specified with the following WordprocessingML:

w:pPrw:textFlow w:val=”btLr” //w:pPr

The textFlow element specifies via the btLr value in the val attribute that the text flow should go bottom to top, and left to right. end example]”

The text and examples mention a textFlow element, but the clause is about a textDirection element.

The text and/or examples and/or schema should be corrected.

ECMA should be banned during one year of submitting fast-track DIS, and should be warned to not submit such poor, copied and pasted specifications derived from an internal Microsoft product documentation.

Joining in the “find a flaw” game with just one of your listed pages (from ECMA 376 part 4). (The following has not been checked against BRM resolutions.)

p. 3997, sec. 5.7.2.13-14:

5.7.2.13 bandFmt (Band Format) This element specifies the formatting band of a surface chart.

5.7.2.14 bandFmts (Band Formats) This element contains a collection of formatting bands for a surface chart indexed from low to high.

A “bandFmt” consists of a index (idx) and a shape property (spPr, which defines things like fill pattern and outline). However, nowhere is it clearly defined how an implementation is to display a “surface chart” from a collection of “formatting bands”.

In particular, the “bandFmts” are used in both 5.7.2.204 (surface3Dchart) and 5.7.2.205 (surfacechart). In the latter (2d) case, I suppose one might guess that the “band formats” are a sequence of shapes to be drawn, one on top of the other, to form a 2d contour chart, but this is not specified explicitly. In the former (3d) case, I have a much harder time trying to infer how the “formatting bands” are to be drawn. From what perspective is the 3d chart drawn, for example…is a scene3d child element (5.1.4.1.26) required to be present in the spPr in this case, and if not what is the default? Apparently not specified. And what in the world should one do if different “formatting bands” of the same chart have different scene3d children!? Even if a scene3d child is present, what 3d perspective (e.g. orthogonal projection?) is used? This is apparently set by 5.1.12.47 (preset camera type), but a cursory inspection of that section reveals perspectives that are grossly underspecified, and seem to be each “defined” largely by a single example image. (For example, “legacy oblique top” and “perspective below” are hardly sufficient to define the precise viewing angles, vanishing points, etcetera. Furthermore, wouldn’t it be better to just define those quantities numerically rather than have a finite number of underspecified presets—compare, e.g., how OpenGL perspective cameras are defined by a position and 6 numbers—much less including “legacy” presets?) Or, for example, what does the “wireframe” boolean attribute (which “specifies the surface chart is drawn as wireframe”, 5.7.2.231), really mean in terms of the visual appearance of the chart? Not explained (and there are multiple reasonable ways to draw 3d wireframes, e.g. as a set of disjoint 3d contour lines, or tesselated into rectangles, or triangles, or…).

One could go on and on….it doesn’t seem possible for two implementations to display surface charts from the same file, especially 3d surface charts, in the same way based only on this specification, without referring to one another’s implementation.

In searching for information on “band formats”, I found another apparent goof:

5.7.2.144 pivotFmts (Pivot Formats) This element contains a collection of formatting bands for a surface chart indexed from low to high.

Except that “pivotFmts” aren’t used in surface charts, they are a child of “chart” (5.7.2.27), and contain a list of “pivotFmt” elements which are a “set of formatting to be applied to the chart that is based on a pivotTable” (5.7.2.143). So 5.7.2.144 seems to be misdescribed (copy-and-pasted from 5.7.2.14?).

It’s a horrifying exercise to go through the ECMA 376 document while pretending it will be your job to decipher and actually implement these features. Even starting at a random page (3997, from Rob’s list), one finds a neverending chain of vagueness and slapdash engineering.

In practice, it seems practically impossible for an implementer to proceed without continually checking how MS Office interprets these Mycenaean scratchings.

This is one of many found ( in a couple of hours of work, because i have better things to do that unpaid Microsoft/ECMA/ISO homework )

note 1 : error not fixed in BRM nor ECMA +2200 pages fixes document.

note 2: this is the legacy clause numbering, the definitive clause number ( and the whole DIS 29500 beast ) is still unknown and was not published for review

Part 4: 2.7 Styles reads:

“Each style defined within a WordprocessingML document requires a style definition. The style definition contains all of the information needed by a consumer to store and display that style within a WordprocessingML document, and is defined using the style element.”

This text is confusing: the consumer doesn’t need to “store” the style in a WordprocessingML document, because he is “consuming” the style from the WordprocessingML document.

The text should be corrected, either mentioning the “producer” or, if the paragraph’s text is only applicable to consumers, deleting the words “store within a WordprocessingML document”.

“Within a WordprocessingML file, styles are predefined sets of table, numbering, paragraph, and/or character properties which can be applied to text within the document.”

This definition is incomplete: according WordprocessingML, styles could be applied not only to text, but to other WordprocessingML objects, i.e: a table style could be applied to a table with only graphics and no text in each cell. In 2.7.3.17 a different and more appropiate definition is given:

“2.7.3.17 style (Style Definition)A style is a predefined set of table, numbering, paragraph, and/or character properties which can be applied to regions within a document.”

( first definition: “can be applied to text”, second definition: “can be applied to regions within a document )

A complete, unified and coherent “style” concept should be given all throughout the text of DIS 29500, Parts 1, Part 2, Part 3, Part 4, or whatever parts remain after the multipart proposal is developed and applied to this document ( don’t rush please ! ).

“Within the context of DrawingML, it must be possible (for considerations to legacy compatibility) to be able to include explicit references to specific shapes within VML Drawing parts.

5.3.2 Basics

Legacy Compatibility is part of the shape definitions and properties of the DrawingML framework.

5.3.2.1 1 legacyDrawing (Legacy Drawing Object)

This element specifies the shape ID for a legacy drawing object. These legacy drawing objects all have a shape ID associated with them that is unique across the entire document. In order to store these legacy shape IDs as well as new shape IDs this legacyDrawing element should be used.

Attributes: spid (Shape ID): Legacy Shape ID that is unique throughout the entire document. Legacy shape IDs should be assigned based on which portion of the document the drawing resides on. The assignment of these ids is broken down into clusters of 1024 values. The first cluster is 1-1024, the second 1025-2048 and so on.”

There are two problems with this text:

i) the first paragraph says “This element specifies the shape ID for a legacy drawing object.” but the same paragraph says later “In order to store these legacy shape IDs as well as new shape IDs this legacyDrawing element should be used.”. So it is no clear if this element specifies shape IDs of legacy drawing objects only, or for legacy objects and new ( no legacy ) objects.

ii) The 2nd paragraph says “Legacy shape IDs should be assigned based on which portion of the document the drawing resides on” but gives no indication of how to perform this assignment. There are three examples that mention one criteria of assignment but this examples are informative.

The text of 5.3.2 should be reviewed to clarify the definition of the “Legacy Drawing Object”, and to provide precise normative of how to assign shape IDs.

Errors like this show that this DIS 29500 has been subject to a furious and rushed abuse of “copy and paste”, improper for a text that expects to be awarded with the ISO brand. All the text should be carefully reviewed and this kind of errors corrected.

ISO should warn standards organization who submits text with so much editorial and technical errors, because this shows a lack of respect to ISO national bodies members that must review such gobbledegooked text.

“…id (East Asian Typography Run ID) Specifies a unique ID which shall b used to link multiple runs containing eastAsianLayout element to each other to ensure that their contents are correctly displayed in the document.

This means that multiple runs which are broken apart due to differences in formatting can be identified as belonging to the same grouping in terms of eastAsianLayout properties, although they are separated into multiple runs of text.

Although there are three runs of content, all three regions shall be combined into a single two lines in one region based on the identical value used in the id attribute for all three runs. end example]”

[ dario:

This is a definition of an attribue of the eastAsianLayout element, but the example references another element: w:asianLayout extraneous to this subclause and the whole Part 4.

This kind of errors are found all throughout Part 4. Examples:

end dario ]

“2.4.73 top (Table Cell Top Margin Exception):

This top cell border is specified using the following WordprocessingML:

The top element specifies a three point border of type thinThinThickSmallGap. end example]”

[ dario:

The example says “thinThickThinSmallGap”, but the text says “thinThinThickSmallGap”.

end dario ]

“2.3.3.18 noBreakHyphen (Non Breaking Hyphen Character): This element specifies that a non breaking hyphen character shall be placed at the current location in the run content. A non breaking hyphen is the equivalent of Unicode character 002D (the hyphen-minus), however it shall not be used as a valid line breaking character for the current line of text when displaying this WordprocessingML content.…If this was not desired, the non breaking hyphen character could be specified as follows:

w:rw:t This makes a very very very wordy and deliberately overcomplicateds /w:tw:nonBreakHyphen/w:t entence. /w:t/w:r

This would display a hyphen character, but would not allow the text to break at that location:

This makes a very very very wordy and deliberately overcomplicated s-entence. end example]”

[ dario:

The definition is about the noBreakHyphen element, but the example contain an w:nonBreakHyphen/ , it is another element, it is a typo?.

[Example: Consider a document in which endnotes shall be positioned at the end of the section. The section properties for this section shall be declared as follows:

w:settingsw:endnotePrw:pos w:val=”endSect” //w:endnotePr…/w:settings

The val attribute is endSect, therefore the position of endnotes is specified to be at the end the section. end example]

Enumeration Value:

sectEnd (Endnotes Positioned at End of Section)”

[ dario:

The text and XML fragment read “endSect” but the enumeration names it “SectEnd”.

Problem detected:

Examples of OOXML markup with invalid XML or with typographical errors in elements and attributes names give poor value to the reader and could result in confusion rather than help to understand this specification. Microsoft first, ECMA second, and then NBs should review all the examples of Part 4 ( aprox. 5500 ) to catch and correct this kind of errors.

Some XML parsing and validating tools are available on the internet that could help in this task ( example: saxon, libxml, Microsoft MSXML, etc. ), some of them at no cost to the user.

ECMA should be banned during one year of submitting fast-track DIS, and should be warned to not submit such poor, copied and pasted specifications derived from an internal Microsoft product documentation.

Rob, good work and thank you. Microsoft is like a beautiful gleaming mansion. Unfortunately, a group of homeless squatters have taken up residence and are trashing the place as bicker among themselves. BTW there are two page 1415s.

@Dario, one way to understand the huge number of spelling errors is that OOXML is too large to spell check. If you load the Word version of Part 4 into Word, it will give you an warming message, telling you that too many spelling errors have been detected and that it must disable spell checking.

And thanks for all the additional examples! I think this gives an important perspective on Microosft’s BRM claims. Does it really matter if the BRM “resolved” 98.44% of the NB ballot comments, if those comments covered less than 2% of the defects in the text?

All the occurrences of elements and attribute names in normative and informative text of Part 4 should be reviewed, and it must be assured that they match the corresponding schema submitted as DIS 29500’s annexes.

There exist open source validation tools ( saxon, libxml ) that can be used by Microsoft ( free of charge ).

Errors in elements and attribute names are found all throughout normative text of Part 4.

Examples:

Part 4, Section 2.9.10 lvlPicBulletId (Picture Numbering Symbol Definition Reference) reads: “This element specifies a picture which shall be used as a numbering symbol for a given numbering level by referring to a picture numbering symbol definition’s numPictBullet element”

But the numPictBullet element is named “numPicBullet” in the clause 2.9.21

Part 4, Section 2.15.1.18 characterSpacingControl (Character-Level Whitespace Compression) reads: “The characterSpacingControl element has a val attribute value of ‘dontCompress’, which specifies that no character compression shall be applied “

But the dontCompress element is named “doNotCompress” in the schema annexed.

The section 2.15.1.10 names in six contiguous lines of the same page ( page 1118 ) the “Automatically Hyphenate Document Contents When Displayed” element with three different names:

“autoHypehenation” in line 18“autoHypehnation” in line 20“autoHyphenation” in line 14

The entire Part 4 should be reviewed to find and correct all the occurrences of elements and attribute names and it must be assured that they are given an unique name through out Part 4. The name must match the corresponding schema submitted as DIS 29500’s annexes.

ECMA should be banned during one year of submitting fast-track DIS, and should be warned to not submit such poor, copied and pasted specifications derived from an internal Microsoft product documentation.

Thanks, anonymous, for those naming errors. Readers who are not XML practitioners should note that XML names are case-sensitive, so “datastoreItem” and “dataStoreItem” are in fact two different and incompatible names.

Specifies that the anchor location for this object shall not be modified at runtime when an application edits the contents of this document. [Guidance: An application might have automatic behaviors which reposition the anchor for a DrawingML object based on user interaction – for example, moving it from one page to another as needed. This element shall tell applications not to perform any such behaviors. end guidance]

As I understand it, this means that, once you set the “locked” attribute for an inline graphic, the application shall not provide anyway to unset it or even to delete the graphic, which doesn’t make much sense. This clause should be written more narrowly to make it clear that the application can changed the locked setting as a result of an explicit user indication, at least.

In the same anchor element, there is also a “relativeHeight” attribute:

Specifies the relative Z-ordering of all DrawingML objects in this document. Each floating DrawingML object shall have a Z-ordering value, which determines which object is displayed when any two objects intersect. Higher values shall indicate higher Z-order; lower values shall indicate lower Z-order.

Problem: It doesn’t specify what should be done if two objects have the same Z-ordering value, nor does it state that the Z-ordering values must be distinct.

I searched the proposed BRM resolutions, and neither of these seems to be addressed.

This simple type specifies that its values shall be a 128-bit globally unique identifier (GUID) value.

It further states that:

This simple type’s contents must match the following regular expression pattern: \{[0-9A-F]{8}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{4}-[0-9A-F]{12}\}.

and they give the example {A67AC88A-A164-4ADE-8889-8826CE44DE6E}”

First problem: they don’t give any guidance regarding how to interpret this ST_Guid string as a “128-bit” integer value. Presumably it is 32 hexadecimal digits, but the implementor should not need to guess, nor is the byte order indicated. [And elsewhere in the standard, GUID values are manipulated using integer arithmetic, e.g. in section 2.8.1 (Font Embedding) it requires you to “reverse the order of the bytes” of a GUID and “XOR the value” with something.]

Second problem: the GUID “shall” be a “globally unique identifier,” but what does it mean to be “globally unique?” Does the implementation have to require that the identifier is unique within the document, within all OOXML documents on the user’s hard disk, or all OOXML documents in the globe? No explanation is given anywhere, nor is any algorithm to generate GUIDs described.

According to Wikipedia, “GUID” is a Microsoft terminology for a specific space of 128-bit identifiers (and specific bit patterns are reserved for use in various Microsoft protocols), and is generated by specific algorithms (some of which apparently create serious privacy problems). Presumably, some version of Microsoft’s GUID “standard” is what is intended for ST_Guid, but as far as I can tell this is never explicitly indicated.

In short, OOXML needs to define what precisely is meant by a “GUID” and how they are to be generated.

Furthermore, ST_Guid is not used consistently throughout Ecma 376. In section 7.6.2.30 (Guid), it defines b:Guid element that “specifies the GUID of a source” and uses the same format as ST_Guid in the example, but is stored as the “ST_String255 simple type”. OOXML also defines a ST_Clsid (Class ID simple type, sec. 7.4.3.2), which stores a GUID with exactly the same format definition as ST_Guid—why not use ST_Guid for class IDs, then?

I searched the proposed BRM resolutions and couldn’t find anything on this issue. (I think. Rob, can you provide a link to a PDF of all the BRM resolutions? I’m not sure I’m searching the right thing for the BRM.)

Just for reference, I think I’ve been using different page numbers from you, Rob—I’ve been using the document page number as listed at the bottom of each page, but you are apparently using the PDF page number (= document page number + 7).

Not that this changes the sampling statistics, but it messes up coordination a bit, sorry.

I can’t believe you are obsessing about these issues. ECMA’s got it covered.

1. They’ll either be fixed in maintenance;

2. The features will be deprecated to an appendix (as a Word97 .doc);

3. The binary API documentation will be posted on the web, someplace MS Live Search can find them — eventually.

I believe every complaint you raise can be answered by some combination of 1, 2 or 3.

– – – – – – – – – – – – –It’s essential you understand the economics of this issue. Failure to approve MS OOXML as proposed will waste all the resources devoted to it’s acceptance so far. It will literally cause the entire work to be rebuilt from scratch.

Given the economics, MS’s ECMA division will probably have to breakup this grand opus into many smaller works, addressing specific facets and subsume ODF specs as a partial fix.

Regarding your point #1, Maintenance. My advice is if they can’t fix it today what makes you think they’ll have the will and time to fix it tomorrow.

Regarding your point #3, posting on the web. The fast track would have been faster if they had submitted a 10 page proposal and then just post ed the rest someday on MS Live Search.

Regarding your comment on economics. Please take into consideration all those people and companies that will use the proposal if it is approved as a standard. Does their time cost nothing? If the proposal is broken so will the implementations based on it. This will cost others money to implement it and then MS some more to fix it and then again some more money to fix the implementations based on the no longer valid “standard”. As a user I prefer it be only MS ECMA that loses. After all it is MS’s fault it is so poorly written. Others have had proposal approved. Clearly it wasn’t cheap, but they got it right.

Rob and Dario – I am a technical writer, and what you are seeing is the result of a “word processor” … like a food processor, but for text. This is a chop-and-drop document that most companies would have been embarrased to release as a preliminary draft.

This could have been spell-checked, even in Word, by breaking it into chapters. But they didn’t bother.

The mismatch between functions and definitions could have been checked with very little effort. I have cross-checked technical documents of several hundred pages myself, in under a week. Was ECMA too stingy to hire a few experienced technical writers and editors? Or was Microsoft spending all the petty cash on stuffing the committees with “members” that appeared, voted and were never seen again?

All of my points were just restatements of classic MS replies to valid issues. I just got them out of the way before someone serious rolled them out.

The one valid belief that I have about the process is that the only way to implement any MS specific formats HAS to be a superset of ODF and other ISO or internationally recognized standards. The only parts they (the MS superset) should include will be application specific tags, mapping, decoding and schema, unless it is clearly new and previously undocumented functionality.

The stupidity of not extending ODF for common functions, or contributing to a BETTER vector graphic ML, or using a defined date standard is appalling.

A valid MS friendly format really should have been under 200 pages, simply by incorporating accepted, recognized methods and standards.

Pardon me for asking, but why do we have to bring up the remaining defects in OOXML? The standardization process should be based on the pledge of the initiator of that standard that it is good.That is, by default any proposed standard should be considered bad/non standard worthy, and the initiator should be required to come up with evidence that proves its worthiness. And should that evidence be unconvincing, the standard should be turned down by ISO.

I had little success finding any Microsoft document bringing up any such evidence. Why is this out of the norm difference? Why aren’t we requiring such a document from the initiator ECMA/Microsoft?

@ TomS, sorry about that. My apologies I was too quick to pass judgment. Your post sounded too much like the “intellectual brilliance” and “Vulcan logic” emerging from Redmond these days. It actually looked like a true Microsoft sponsored post.

I totally agree with your comment. It is something I never actually considered as an option. Maybe because by default I believe Microsoft will want to go its own way rather than share the path.

Heck it would be so much easier if they did use ODF. On top of that they already have a pretty good MS Office to ODF converter. Its called Open Office. So even that would be covered.

As a side note the ODF (as published May 1 2005) is 706 pages long (minus 30 for index). That makes OOXML about 9 times bigger. Now given Rob’s comment on three different ways to show font color we can think there are three ways to do everything. That still makes OOXML 3 times bigger (9/3=3). Now unless DIS 29500 is published with font size 36, there has got to be something really really wrong about it don’t you think?