Microsoft Office document corruption: Testing the OOXML claims

Summary

In this post I take a look at Microsoft’s claims for robust data recovery with their Office Open XML (OOXML) file format. I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors. Based on these result, I estimate data recovery rates for Word 2003 binary, OOXML and ODF documents, as loaded in Word 2007, Word 2003 and in OpenOffice.org Writer 3.2.

My tests suggest that the OOXML format is less robust than the Word binary or ODF formats, with no observed basis for the contrary Microsoft claims. I then discuss the reasons why this might be expected.

The OOXML “data recovery” claims

I’m sure you’ve heard the claim stated, in one form or another, over the past few years. The claim is that OOXML files are more robust and recoverable than Office 2003 binary files. For example, the Ecma Office Open XML File Formats overview says:

Smaller file sizes and improved recovery of corrupted documents enable Microsoft Office users to operate efficiently and confidently and reduces the risk of lost information.

Those are just four examples of a claim that has been repeated dozens of time.

There are many kinds of document errors. Some errors are introduced by logic defects in the authoring application. Some are introduced by other, non-editor applications that might modify the document after it was authored. And some are caused failures in data transmission and storage. The Sinofsky press release gives some further detail into exactly what kinds of errors are more easily recoverable in the OOXML format:

With more and more documents traveling through e-mail attachments or removable storage, the chance of a network or storage failure increases the possibility of a document becoming corrupt. So it’s important that the new file formats also will improve data recovery–and since data is the lifeblood of most businesses, better data recovery has the potential to save companies tremendous amounts of money.

So clearly we’re talking here about network and storage failures, and not application logic errors. Good, this is a testable proposition then. We first need to model the effect of these errors on documents.

Modeling document errors

Let’s model “network and storage failures” so we can then test how OOXML files behave when subjected to these types of errors.

With modern error-checking file transfer protocols, the days of transmission data errors are a memory. Maybe 25 years ago, with XMODEM and other transfer mechanisms, you would see randomly-introduced transmission errors in the body of a document. But today the more likely problem would be that of truncation, of missing the last few bytes of a file transfer. This could happen for a variety of reasons, including logic errors in application-hosted file transfer support , to user-induced errors from removing a USB memory stick with uncommitted data still in the file buffer. (I remember debugging a program once that had a bug where it would lose the last byte of a file whenever the file was an exactt multiple of 1024 bytes.) These types of error can be particularly pernicious with some file formats. For example, the old Lotus WordPro file format stored the table of contents for the document container at the end of the file. This was great for incremental updating, but particularly bad for truncation errors.

For this experiment I modeled truncation errors by generating a series of copies of a reference document, each copy truncating an additional byte from the end of the document.

The other class of errors — “storage errors” as Sinofsky calls them — can come from a variety of hardware-level failures, including degeneration of the physical storage medum or mechanical errors in the storage device. The unit of physical storage — and thus of physical damage — is the sector. For most storage media the size of a sector is 512 bytes. I modeled storage errors by creating a series of copies of a reference document, and for each one selecting a random location within that document and then introducing a 512-byte run of random bytes.

The reference document I used for these tests was Microsoft’s whitepaper, The Microsoft Office Open XML Formats. This is a 16-page document, with title page with logo, a table of contents, a running text footer, and a text box.

Test Execution

I tested Microsoft Word 2003, Word 2007 and OpenOffice.org 3.2. I attempted to load each test document into each editor. Since corrupt documents have the potential to introduce application instability, I exited the editor between each test.

Each test outcome was recorded as one of:

Silent Recovery: The application gave no error or warning message. The document loaded, with partial localized corruption, but most of the data was recoverable.

Prompted Recovery: The application gave an error or warning message offering to recover the data. The document loaded, with partial localized corruption, but most of the data was recoverable.

Recovery Failed: The application gave an error or warning message offering to recover the data, but no data was able to be recovered.

Failure to load: The application gave an error message and refused to load the document, or crashed or hanged attempting to load it.

The first two outcomes were scored as successes, and the last two were scored as failures.

Results: Simulated File Truncation

In this series of tests I took each reference document (in DOC, DOCX and ODT formats) and created 32 truncated files corresponding to 1-32 bytes truncation. The results were the same regardless of the number of bytes truncated, as in the following table:

[table “3” not found /]

Results: Simulated Sector Damage:

In these tests I created 30 copies of each reference document and introduced a random 512-byte run of random bytes, with the following summary results:

[table “6” not found /]

Discussion

First, what do the results say about Microsoft’s claim that the OOXML format “improves…data recovery…beyond what’s possible with Office 2003 binary files”? A look at the above two tables brings this claim into question. With truncation errors, all three word processors scored 100% recovery using the legacy binary DOC format. With OOXML the same result was achieved only with Office 2007. But both Office 2003 and OpenOffice 3.2 failed to open any of the truncated documents. With the simulated sector-level errors, all three tested applications did far better recovering data from legacy DOC binary files than from OOXML files. For example, Microsoft Word 2007 recovered 83% of the DOC files but only 47% of the OOXML files. OpenOffice 3.2 recovered 90% of the DOC files, but only 37% of the OOXML files.

In no case, of almost 200 tested documents, did we see the data recover of OOXML files exceed that of the legacy binary formats. This makes sense, if you consider this from an information theoretic perspective. The ZIP compression in OOXML, while it compresses the document at the same time makes the byte stream denser in terms of the information encoding. The number of physical bits per information bits is smaller in the ZIP than in the uncompressed DOC file. (In the limit of perfect compression, this ratio would be 1-to-1.) Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document. In other words, a compressed document, all else being equal, will be less robust, not more robust to “network and storage failures”. Because of this it is extraordinary that Microsoft so frequently claims that OOXML is both smaller and more robust than the binary formats, without providing details of how they managed to optimize these two opposing and complementary qualities.

Although no similar claims have been made regarding ODF documents, I tested them as well. Since ODF documents are compressed by ZIP, we would expect them to also be less robust to physical errors than DOC, for the same reasons discussed above. This was confirmed in the tests. However, ODF documents exhibited a higher recovery rate than OOXML. Both OpenOffice 3.2 (60% versus 37%) as well as Word 2007 (60% versus 47%) had higher recovery rates for ODF documents. If all else had been equal, we would have expected ODF documents to have lower recover rates than OOXML. Why? Because the ODF documents were on average 18% smaller than the corresponding OOXML documents, so the fixed 512-byte sector errors were proportionately larger impact in ODF documents.

The above is explainable if we consider the general problem of random errors in markup. There are two opposing tendencies here. On the one hand, the greater the ratio of character data to markup, the more likely it will be that any introduced error will be benign to the integrity of the document, since it will most likely occur within a block of text. At the extreme, a plain text file, with no markup whatsoever, can handle any degree of error introduction with only proportionate data corruption. However, one can also argue in the other direction, that the more encoded structure there is in the document, the easier it is to surgically remove only the damaged parts of the file. However, we must acknowledge that physical errors, the “network and storage failures” that we looked at in these tests, do not respect document structure. Certainly the results of these tests call into question the wisdom of claiming that the complexity of the document model leads it to be more robust. When things go wrong, simplicity often wins.

Finally, I should observe that application difference, as well as file format differences, play a role in determining success in recovering damaged files. With DOC files, OpenOffice.org 3.2 was able to read more files than either version of Microsoft Word. This confirms some of the anecdotes I’ve heard that OpenOffice will read files that Word will not. With OOXML files, however, Word 2007 did best, though OpenOffice fared better than Word 2003. With ODF files, both Word and OpenOffice scored the same.

Further work

Obviously the field of document file robustness is a complex question. These tests strongly motivate the thought that there are real differences in how robust document formats are with respect to corruption, and these observed differences appear to contradict claims made in Microsoft’s OOXML promotional materials. It would require more tests to demonstrate the significance and magnitude of those differences.

With more test cases, one could also determine exactly which portions of a file are the most vulnerable. For example, one could make a heat map visualization to illustrate this. Are there any particular areas of a document where even a 1-byte error can cause total failures? It appears that a single-byte truncation error on OOXML documents will cause a total failure in Office 2003, but not in Office 2007. Are there any 1-byte errors that cause failure in both editors?

We also need to remember that neither OOXML nor ODF are pure XML formats. Both formats involve a ZIP container file with multiple XML files and associated resources inside. So document corruption may consist of damage to the directory or compression structures of the ZIP container as well as errors introduced into the contained XML and other resources. The directory of the ZIP’s contents is stored at the end of the file. So the truncation errors are damaging the directory. However, this information is redundant, since each undamaged ZIP entry can be recovered in a sequential processing of the archive. So I would expect a near perfect recovery rate for the modest truncations exercised in these tests. But with OOXML files in Office 2003 and OpenOffice 3.2, even a truncation of a single byte prevented the document from loaded. This should be relatively easy to fix.

Also, the large number of tests with the “Silently Recover” outcome is a concern. Although the problem in general is solved with digital signatures, there should be some lightweight way, perhaps checking CRC’s at the ZIP entry level, to detect and warn users when a file has been damaged. If this is not done, the user could inadvertently work and resave the damaged work or otherwise propagate the errors, when an early warning of the error would potentially give the user the opportunity, for example, to download the file again, or seek another, hopefully, undamaged copy of the document. But by silently recovering and loading the file, the user is not made aware of their risky situation.

Files and detailed results

If you are interested in repeating or extending these tests, here are the test files (including reference files) in DOC, DOCX and ODT formats. You can also download a ZIP of the Java source code I used to introduce the document errors. And you can also download the ODF spreadsheet containing the detailed results.

WARNING: The above ZIP files contain corrupted documents. Loading them could potentially cause system instability and crash your word processor or operating system (if you are running Windows). You probably don’t want to be playing with them at the same time you are editing other critical documents.

Updates

2010-02-15: I did an additional 100 tests of DOC and DOCX in Office 2007. Combined with the previous 30, this gives the DOC files a recovery rate of 92% compared to only 45% for DOCX. With that we have significant results at 99% confidence level.

Given that, can anyone see a basis for Microsoft’s claims? Or is this more subtle? Maybe they really meant to say that it is easier to recover from errors in an OOXML file, while ignoring the more significant fact that it is also far easier to corrupt an OOXML file. If so, the greater susceptibility to corruption seems to have outpaced any purported enhanced ability of Office 2007 to recover from these errors.

It is like a car with bad brakes claiming that is has better airbags. No thanks. I’ll pass.

There is a flaw in your analysis with regards to random sector corruption. If you assume that a specific number of random sectors on a hard drive will become corrupted, then a smaller file has less of a chance of having one of its sectors corrupted. Thus, even though sector corruption may be more devastating to a particular format, there is less of a chance of corruption in the first place if the file in question is smaller.

Similarly, if the risk of truncation grows as the file size grows, smaller file sizes mean a greater chance that the file will not be truncated.

However, considering the fact that the smaller ODF files had higher recovery rates for the same number of corrupted sectors than OOXML, one can only conclude that factoring in error probability would help ODF more than it would OOXML.

Hi Matthew — I did consider that, and it would certainly be true for a single isolated file on a drive. The chance of any given file being harmed by sector damage will be inversely proportionate to the size of the file. However, documents rarely exist in isolation. There is the economic incentive to run disks near capacity. Otherwise you are overpaying for storage you don’t need. So consider a drive nearly full of documents. If a sector fails, it will damage some document.

That said, what you say is true and a valid. But we’re talking about two different things. One is from the file’s perspective, asking “What is the probability that a bad sector on the drive will result in me being unable to load?” And the other is from the user’s perspective, asking “What is the probability that a bad sector on the drive will result in me being unable to load one of my documents?”

As for truncation errors, I’m not sure the rate of errors is proportionate to file size, at least not at this scale. You might argue that a 500MB file is more likely to suffer a timeout error or disconnect compared to a 50KB document. But is there a significant difference between a 100KB file and a 50KB file? In any case I think the more common failure mode involves truncation caused by write-behind buffering, which would be independent of file size, for any file larger than the buffer chunk size.

Another possible criticism is that the above tests do not reflect any interesting interactions between disk sector size and file system cluster size. For example, if cluster size matches sector size, and the file system aligns documents on cluster boundaries, then it will always be the case that sector damage to byte 100 will be correlated to byte 101. Either both will be damaged, or neither will be damaged. But (absent a multi-sector failure) there is no correlation between damage to byte 512 and byte 513, since they will be on different sectors. So a more accurate error injection model would be to align the errors at simulated cluster boundaries. I think that would make more of a difference with a fixed-length record type file format.

This suggests to me that there should be an option in the file format standard to store the file twice back-to back (ABCDEABCDE) , to eliminate both of these potential sources of error.
The size reduction gained in moving to compressed file formats could be spent on recoverability.

Can you give an approximate size ratio for file in ODF, OOXML and DOC?

Immediately having posted my previous comment, it occured to me that there will of course(!) be more efficient methods of obtaining reliability than storing 2 identical copies of the file.

Nonetheless, the file format should provide these as an option to users.

Rob, can you put something in ODF 1.2 which says “this file is X Bytes, with an additional Y Bytes of data recovery information” and specify that for ODF 1.2 Y will be zero but in subsequent versions it may be greater than 0?

I think you’re looking at the wrong problem. I would guess the vastly dominant form of document corruption is that caused by implementation errors – bugs in the editor. I find it easy to believe that it’s easier to recover from a bug in an XML than in a dump-from-memory. In Real Life, people that use XML formats probably actually will encounter much less incidents of totally corrupt files, just because bugs in editors will be easier to ignore.

True, this has nothing to do with what Sinofsky said, but he’s a Senior Vice President so you can’t expect him to be in touch with reality. Some manager 4 levels below him was asked to prepare a Powerpoint presentation with talking points about the new format, and the engineers explained the advantage of easier recovery. By the time it got to Sinofsky it was a single context free bullet point that said “improved recovery”, which he used as a theme to wax rhapsodic about without having a clue what that means.

Marketing is saying things that sound good, not correct things. Proving that some marketing claim is incorrect is too easy. It’s much more interesting to check the actual results – what part of actual documents becomes corrupt, and to what degree? I’m guessing that your experiment tells us very little about that, simply because it models a very small part of the prevalent corruption mechanisms.

@Felix, that starts getting into the realm of what we call “Error Correcting Codes” or ECC’s. These provide optimal ways of adding redundant data to a data stream in order to be robust to a given level of noise. But I think the “correct” way of handling this in today’s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID). I don’t mean to suggest we solve this problem in the file format itself. I’m just pointing out the idiocy of Microsoft claiming that they have done so.

@Uri, this is far more than the Sinofsky quote in a press release. This claim has been repeated over and over again, including the claim that OOXML has magical recovery properties in the face of “storage failures”.

For example this case study claims: “XML-based documents are less likely to lose valuable information if the file suffers a partial corruption through a transfer or storage failure, for example. This makes the new Open XML format more resilient and reliable than the binary format used by previous Microsoft Office releases. Most files can still be opened if a component within the file is damaged.”

This is not an off-the-cuff improvisation by a single VP. You will find this mantra repeated dozens of times, in press releases, in white papers, in training materials, in notes to ISO NBs, in Microsoft blogs, etc.

“But I think the “correct” way of handling this in today’s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID). I don’t mean to suggest we solve this problem in the file format itself.”

That’s great for business users with RAID servers, but for everybody else an option which said:

Use Safe File Saving (this will increase the size of your documents)? On/Off

You’re right, Microsoft’s story regarding resistance to corruption is a complete fabrication. And it’s not a one-off mistake as I thought, but deliberate misinformation.

I still believe that users of XML formats (ODF, OOXML, etc) encounter much less severe document corruption, though I don’t have data to back that up. Microsoft could have said: “Remember all those documents our buggy software corrupted? Now we have the same amount of bugs but they are less likely to corrupt your files.” But that’s not good marketing, so they aligned on an alternative story about network and storage failures, which is unfortunately a lie. It’s hardly commendable, but the overall claim of less corrupted documents probably stands.

@Uri, Here is something I would believe. By taking their old cruddy DOC reading code, that was an accumulation of a decade of hacking, and rewriting new import/export support for OOXML, and subjecting it to extensive testing, they have an input module that is higher quality and easier to maintain than what they had before. This could lead to less corruption. But this “fresh start” approach has nothing to do with OOXML. I think they would have had the same outcome if they wrote a fresh ODF filter, or even if they rewrote their DOC code.

As for XML formats in general, their modularity is in their abstract model. It is not necessarily in the character stream representation. The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X. So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized. Of course, you could, as OOXML does, move different pieces into different XML files. But that won’t have a real impact if, like OOXML, 80% of the stuff still ends up in one place, e..g, in document.xml, as it does.

But it is an interesting question: if you wanted to design an XML format to be resistant to certain kinds of damage, what would be your design points?

@Felix, it is an interesting idea, but I’d still solve it at a level higher than the XML. Why? Because an ODF document can contain other resources, like image files, that are not in XML. But maybe using Error Correcting Codes at the ZIP level would work?

Although it was not the point of the experiment, I think your tests show that there is room for improvement in OpenOffice’s recovery code. Microsoft Office was able to recover the OOXML Simulated File Truncation 100% of the cases, while OpenOffice 3.2 failed 100% of the cases. Similarly, Microsoft Office was able to recover 47% of the Simulated Sector Damage cases, but OpenOffice 3.2 was only 37% successful. Did you submit a report (or two) to the OpenOffice.org Issue Tracker?

OpenOffice gets the same result as Office 2003 in truncation tests, a big 0%. Presumably they just need to use a more robust ZIP routine that scans the entries serially if the directory at the end is corrupted. On the sector damage, OpenOffice gets better results than Microsoft Office 2003.

In any case, to your question, no I have not entered a defect report on these issues. Since I do not use OOXML myself, the fact that OOXML documents are so easily damaged in irrecoverable ways is not my problem. However, I have posted the test cases and code for reproducing these results,. The vendors are welcome to deal with this bug as they wish. I’d mainly hope that Microsoft would stop making baseless and false robustness claims that are easily refuted. One can hope, at least.

>> As for XML formats in general, their modularity is in their abstract model. It is not necessarily in the character stream representation. The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X. So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized.

To be sure I read correctly, this discussion is in reply to Uri but does not correlate to the tests performed, right? In other words, there was no block level testing of XML, right? .. meaning the block damage was done against zipped files only, right? [Testing unzipped ODF/OOXML would depend on how the file system would store related but distinct XML files, and this could vary quite a bit (over time and over different file systems).]

While on this subject…

I would like to see the effects of damages to individual XML files for ODF and for OOXML. Though I am much more familiar with ODF than OOXML, there is much I don’t know about either of these formats. What I am after is to see what percentage of localized errors (like the random block data test) have devastating effects on these two formats.

I wonder because I think I had heard that a greater percentage of OOXML is “delta-based”, meaning that the effect of errors “early” in the stream would compound afterward at a faster rate than “nondelta-based” (but still nested) XML.

Besides the statement that such a result would make about a level of robustness of each format, I’m also thinking that it might be more difficult for third parties to keep up with a proprietary OOXML implementation (eg, a dominating implementation like MSOffice20xx) than it would be a proprietary ODF implementation based on “bugs” found respectively within such closed source implementations. Looked at from a different angle, if the results are as I imagine is the case, then it would be easier to keep a closed-source OOXML implementation from being matched sufficiently well by competing third parties (than would be the case for ODF) through the careful insertion of “bugs” that disagree with the spec (or that leverage ambiguous or missing components of the spec).

Regardless of which is easier to “manipulate”, OOXML or ODF, knowing that answer may help third parties better decide how to address a closed source market leader (in OOXML or in ODF).

I am not interested in random disk errors to files. I am thinking that modeling in some way random errors at the XML level (perhaps crudely simulated with random localized errors to XML files) models deviations from “spec” based on bugs. Syntax deviations are like random errors on the file stream. Semantic deviation might function similarly or not depending on the details.

[I used quotes around “bugs” to suggest that certain bugs might not be accidental or, otherwise, might be allowed to exist by a market dominant implementer since these bugs could promote profit margins through their effect of making it more difficult for third parties to reproduce these buggy effects in their competing products.]

It could be useful to identify which format is more amenable to having X changes in semantics/syntax have a more compounding effect more likely to lead to gibberish by third party products not able to reproduce these changes correctly. Intuition led (leads?) me to think that a “delta” representation of data is less robust to changes (“errors”) in the stream, all else being equal. Maybe I need to think about this more, but the idea is not too unlike what you mentioned (or implied) about the problem with “silent” errors going into a document and how these might surface as time goes on, perhaps having a compounding effect as well (through repeated document manipulations, savings, openings, etc).

@Jose_X, correct, my tests were block errors at the ZIP package level. However, usually translated into block-level errors in the XML. You can see that in the test files I posted. I think as documents get larger this becomes even more true. This is because, except for directory and compression dictionary areas — which are particularly vulnerable — a contiguous portion of the ZIP will usually decompress into a contiguous portion of content, especially if the block size is much less than the size of the main XML file.

To you’re other point, I’ve done tests like that before for other formats, but more from the perspective of testing the app. For example, when I worked on the initial port of Xalan XSLT engine to the C++ version I used similar techniques to introduce random perturbations of XSLT scripts to see whether these errors would be handled graciously by the code. That was easy enough, since the engine itself was command-line driven and you could wrap it all in a WIN32 debugger session to catch and recover from all memory faults, etc. Doing the same for a GUI app like MS Office or OpenOffice in theory is possible. You could even be methodical and introduce a block error starting at every successive offset in the file and in this way map out exactly which parts of the document are vulnerable.

In terms of errors introduced by application bugs, I you need to set the criteria carefully. It is a good thing, IMHO, if a random error makes a valid document invalid since that allows errors to be identifies by common tools, e.g., schema validators. But we also want applications to be robust to errors, so users don’t loose their work. Most of the errors I detected were things that should be recoverable by using a robust ZIP library and a robust XML parser. Until the apps handle those more surface issues better, I don’t think testing is going to tell us much about the robustness of the underlying document models. Or, maybe one could eliminate the apps from the test altogether and simply test the models in isolation, with simpler command-line apps that merely try to interpret the document, resolve all references to content and styles, etc., but don’t attempt to render anything?

First, that doesn’t apply in the truncation case because that happens at the zip level and does not affect the XML inside for small truncations (Rob stated this was because the redundant zip toc? is what lies at the end of the file).

Second, for cases where the XML is likely affected, it’s unclear from the data and discussion above (iirc) whether an error for ODF or for OOXML was a fatal error and, if so, whether that was the exact (or part of the) reason the application failed.

If it turns out that the failures were solely because of the XML standard but were otherwise recoverable, then this is something to be considered carefully. However: I’m guessing that if what was lost/corrupted was solely text or various other types of data, then the error might not be defined by the standard as a fatal error. I’m guessing a fatal error would include damage to the information defining the XML structure, and in most of these cases, recovery might not be possible, regardless of what the standard says.

To criticize this aspect of the standard, we should be specific (point to specific mentions of “fatal error”), and then come up with examples that show the fatal error requirement is unwise in such a case (eg, because faithful recovery would have been possible and hence desirable). Not taking these steps leaves us hand waving and possibly dwelling on a non-issue.

You said : “Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document.” I think that is not true, as you perfectly say in the previous sentence. I think you wanted to say :
“Because of this, a physical error of 1-bit introduces a bigger error in the information content of the document in the ZIP than in the uncompressed DOC file.”