So where are all the OOXML documents?

Google has a nice feature that allows you to search for documents that match a given file type. This is done by adding “filetype:NNN” to your query, where NNN corresponds to the file type. This feature has supported the ODF and OOXML document formats for at least two months, when I first noticed it. I’ve been tracking some numbers since then and now have enough data to make some observations.

At last count the totals were:

Format

Count

ODT

85,200

ODS

20,700

ODP

43,400

Total ODF

149,300

DOCX

471

XLSX

63

PPTX

69

Total OOXML

603

As you can see, there is some round-off happening on the upper range. Perhaps at the high-end counts are estimates based on sampling?

In any case, I am rather surprised by the low counts given for OOXML documents, especially considering that this format has been supported since the Office 2007 beta last summer. According to Brian Jones, there have been over 4 million downloads of the OOXML Compatibility Pack for older versions of Office, and that there is a new community of, “over 300 other companies and partners who care deeply about OpenXML”. We’re also told that Office 2007 sales are above expectations, “two times greater than the purchases of Office 2003” according to one research firm. Recently announced third-Quarter results for Microsoft showed “better than expected” results for Office 2007 sales, $200 million better, according to Microsoft CFO Chris Liddell.

So with all this evident love for Microsoft Office 2007, why is it that 6-months later there are only 63 OOXML spreadsheet documents on the web, something like 0.3% of the number of ODF spreadsheet documents? How can there be 300 companies supporting OOXML and only have 69 OOXML presentations on the web? (This is starting to sound like when I say I support 30 minutes of aerobic exercise a day. I don’t do it, but I sure support it!)

OK, I know the argument about “dark matter”, that Google indexes only the tip of the iceberg, that there is a lot of data squirreled away on PC hard-drives, behind corporate fire walls, etc., stuff that Google will never see. But the same is equally true for ODF documents, right? I have tons of ODF documents on my laptop, but none of them are indexed by Google.

Of course ODF has been around for a year longer than OOXML. That’s an important fact to acknowledge. We can put that in perspective by plotting the graph of ODF and OOXML document counts against the number of days since adoption of these two standards. So ODF counts are based on a start of 1 May 2005 and OOXML starting in 7 December 2006, when OASIS and Ecma respectively approved them. You get this:

As you can see, ODF has a nice upward trend. OOXML is also trending upwards, though it is somewhat lost at this scale. If you do the analysis it comes out to around 300 new ODF documents per day versus 6 for OOXML. So, two years later, ODF adoption, in terms of documents per day, is 50-times greater than OOXML is, at a time which should be OOXML’s high-growth period, considering all the great news that is coming out of Redmond.

So I’m a somewhat at a loss to appreciate the significance of Novell or Corel adding OOXML support to their editors. With only 63 OOXML spreadsheets out there, wouldn’t it be cheaper just to hire someone to retype the documents in the destination application? The average user is more likely to find a Buffalo Nickel in their lunch change than to find an OOXML document outside of captivity.

As you (among others) have pointed out, adoption of document formats does follow a network effect, the more of them are out there, the more the format will be supported by apps, and the more of them will be created.

Given that, it’s not a huge leap to assume that adoption might follow an exponential curve to begin with (possibly turning into a horizontally-stretched S eventually) which would completely account for a low rate of increase early on, and a higher rate of increase later, when there has been more time for adoption and more documents already exist.

I’m very much in favour of proving ODF’s superiority, but I don’t think this proves anything … yet.

(One interesting extrapolated stat to consider from this though – MS claims that there are “billions” of .doc files out there. Google counts 34,900,000. So, if Google is counting 1/30 of existing documents, based on a conservative estimate of “billions” meaning “1.something billions” in MS market-speak, then there are at least 4,500,000 ODF documents out there. Not too bad! :)

We also need to remember that MS Office has (by some counts) 97% market share, and Office 2007, which is beating all sales expectations, has OOXML as the out-of-the-box default format. So regardless of how you project out the curve, the present numbers for OOXML are pretty dismal.

I’ll continue to track the numbers are report every few months. Since we don’t have historic numbers for ODF to look back at, how we got to 150,000 ODF documents is uncertain. It would be interesting to see the curve for download counts for OpenOffice & KOffice. It probably tracks that curve. I’d be very surprised if OpenOffice & KOffice downloads are on the flat upper side of the S-curve. I think ODF and alternate office suites are still in the early period of their adoption and greatest growth is yet to come.

“I’d be very surprised if OpenOffice & KOffice downloads are on the flat upper side of the S-curve.”

Oh crumbs no! I didn’t mean to give that impression at all. I was just pointing out that adoption of a document format cannot increase exponentially *forever*, and meant that *when a formet gets a significant (> 2/3?) market share* the rate of adoption will tail off.

However, while document format adoption can’t follow an exponential curve forever, it probably can while adoption is low, as it is for ODF and OOXML right now. Given that, you could probably plot a single exponential curve through both sets of data on that graph, showing that OOXML takeup is no slower than ODF was at the same time.

Surely one could plot many curves through this data. We’ll need to track this data for a few more months to get a better sense of it.

Even taking your suggested number of 5% Office 2007 penetration, that is 5% of 400 million estimated Office users, or 20 million users. Are we saying that 20 million users of Office 2007 have managed to put only 600 or so OOXML documents on the entire web in the past six months?

Is that credible?

Or are only a small fraction of Office 2007 users actually saving and distributing documents in OOXML format? Or have a lot more licenses been sold that aren’t resulting in users that are actually provisioned and using Office 2007? Or is Google just wrong in the results they are reporting?

This just doesn’t add up. You don’t say that your new product is beating all sales expectations and then have only 63 spreadsheet documents on the web after half a year later.

The biggest one is that it’s “trust us, we know the numbers”. I clearly have a problem with that, since Microsoft is one of the less trusted company in our industry.

What they can’t refute is numbers that independent vendors have. I sell two products, one 100% targeted to those new file formats, and the other which targets both old and new file formats. I won’t give the numbers, but suffice to say this is not exactly in-line with Microsoft enthusiasm, mind-blowing growing community, or any other gratuitous crap they are putting in their marketing blogs. Long story short, pure lies.

Let me further comment their claims. They often refer to the massive 300,000 Office developers. Well, if you take a look at Office related newsgroups, it will be obvious after a while that most of these guys are individuals, not groups sharing a common passion.

I guess their own marketing website, openxmldeveloper.org had so much traffic that they had to split in two… /sarcasm

On good days, actually, openxmldeveloper.org gets 2 new posts. And those answering there are making a living off of it (as I have explained already in the past). So much for the “open community”.

If you head over to their new marketing website, openxmlcommunity.org, you’ll realize a couple of things.

One, that pretty much the only thing you can do is click on “Join the community”. And when you do that, you are asked to sign the pro-OOXML petition. I guess that pretty much explains why there are doing this website in the first place : it’s just a url to provide resilience for their shitty partners-in-crime petition.

Two, if you take a look at the list of companies who have signed the petition, and particularly French companies (I talk what I know best), you have to say that one of these are ILOG. ILOG is not exactly what what you would call a company backed with a passion to share thoughts/ideas/knowledge/code in a community. ILOG is run by suits in and out, and their business is to sign consulting deals with large customers. You get it right. What ILOG is doing here is back Microsoft so that next time Microsoft gets in touch with some government org or large company related to them, Microsoft can say “hey, see, ILOG is backing OOXML so it can’t be that bad!”, all while not telling the truth, that ILOG joined Microsoft to make up those deals, create a false sense of backers.

I understand the real world is full of non-genuine backers, partners in crime that are ready to do anything. But really, it’s stupid how far it goes sometimes, and it’s insulting to everyone’s competency to recognize what’s genuine and what’s not. I guess there lies the true meaning of Microsoft “open” initiative, or lack thereof…

This analysis looks at days since standardization. How would the analysis looks like if we count days since the initial release of a supporting application? Even a beta or pre-standard implementation should count because if users have the capability to produce a document, they have the capability to publish. I remember to have used ODF well before ISO approved it.

I wonder how good a metric Google is. You don’t publish in a format if you don’t believe the readers are out there unless you want to make a point of supporting the format. This may just meant there are militant ODF adopters that care about their chosen format while OOXML users give it a big yawn.

This line of thought just underlines the ties between OOXML and Office 2007. If people don’t believe Office 2007 is ubiquitous enough, they won’t care publishing in this format alleged third party application support not withstanding. ODF users on the other hand really care about vendor independence and interoperability and will publish in their format to promote this goal.

Monitoring these numbers in the coming month might not do as much good as you may want. It will be interesting to see how many OOXML documents are produced by Microsoft just to raise the counts.

Currently (May 11, 2:23 am) if we restrict the search to the microsoft.com domain, we find 28800 doc, 981 xls and 14000 ppt. In contrast we find 67 docx, 2 xslx and 27 pptx. I would say adoption has been bleak on Microsoft’s own site. So much for to recommendation to convert existing documents to OOXML.

For the record I also find on microsoft.com 3 xlsm and no xlsb. This means there are more “macro enhanced” OOXML on Microsoft’s own site than genuine OOXML but they don’t use their new binary format yet.

It could be that early adopters of OOXML are savvy, and are in fact saving to older .doc, .xls and .ppt formats when publishing documents on the Internet, as they know the older formats are readable to more people. At some point, such publishers will think that enough people can read native OOXML and the number will explode.

One interesting additional fact is that 11% of the OOXML documents found come directly from Microsoft’s website.

In the long run, the reasons why OOXML documents are not appearing may or may not be important. A great deal of the “network effect” has to do with perception. If people see and use ODF documents on the web, especially with plugins from Firefox that let them use those without any other software, the perception could arise that ODF is “important”, and OOXML is not. If I were Microsoft, I’d be working like crazy to get allies to post any documents in both binary and OOXML formats, just to fuel that perception, but they have not been very successful at that as of yet.

Stats are nice.Today (only a day after you published your analysis) the numbers I find in google aredocx 525xlsx 69pptx 100That is a 15% increase in a day. Or lets be fair, probably two days.And allthough it seems Ecma approved OOXML in december 2006 rather than juli I will just take your asumption that the difference is about a year. With an increase of 15% ooxml documents per two days my prediction is for 50 million OOXML documents in google search to be reached next year…Clearly these stats support that !!

Wraith, I agree with you that OOXML was approved in December, not July. I was using American date conventions (12/7/2006), not European (7/12/2006), thus the confusion. Having two standards for the same thing only leads to confusion, as you’ve demonstrated.

Ben, thanks for the details on where the docs are coming from. I didn’t think of that.

I’ll take a look at a log-log chart as well, although I admit not being a fan of them. In a log-log chart, almost any data can be coaxed into straight line, thus its power and thus the opportunity for misuse. Better in my mind to fit the data to models and check for goodness-of-fit.

So what growth curve would you expect? With a constant number of users producing documents at a steady rate you would expect linear growth in document count over time. Suppose the user base is also growing with time. Then the document count would be a time integral of the user growth function, right? If user growth is linear, then document growth is quadratic, etc.

But we don’t have enough data points to really distinguish growth models at this point, so I’ll continue to track the data and post on this again when I can make more conclusions.

I find google trends interesting, but also wonder if it means anything significant. A graph of docx vs. odf searches shows a lot of interest in a few areas (cities and regions tabs) and almost none elsewhere.

I wonder if you have any thoughts about whether any deeper meaning can be read into measuring seach volume.

Further analysis reveals that Google filetype:odt does recognize the ODF types, but filetype:docx is reported as “Format unrecognized”. The preview text displays the names of parts inside the ZIP file and not the body of the document.

Angus, This appears to be a general problem with MSDN, nothing specific to OOXML. Try doing a search of site:msdn.com for any filetype, DOC, PDF, etc. Google doesn’t seem to see it.

It would be interesting to hear how Google determines document types. By content type headers in the HTTP response? Or inferring from file extensions? Certainly, depending on how this is done, a misconfigured web server could cause this to break. There may be other web sites hosting ODF documents that are similarly misconfigured.

Google Trends is tricky because there are so many alternate ways of describing the terms. For example, do you search for ODF, OpenDocument or Open Document? Similarly, is it OOXML, OpenXML, Office Open XML, or Open Office XML (incorrect, but often stated that way)? You would need to add up at least these variations to get a good sense of the trends.

In any case, that shows you the interest level in a format, not necessarily the deployment or use of a format.

It seems to me that since every copy of Office 2007 is probably pinging Redmond every day, with all this Genuine Disadvantage, registration and auto update stuff, so if they wanted Microsoft could tell us an exact number of running Vista or Office 2007 machines. Surely they know the exact number of activated Vista and Office 2007 users. And surely if it was good news they would have told us by now.

FWIW, I’ve just popped over to Brian Jones’ site and followed my nose to http://www.openxml.biz/ and downloaded the OpenXML Writer, both binary and source tree, and … they don’t mention any license terms. Not even to disclaim any license terms whatever and put it into the Public Domain.

If I download any such source tree from IBM or Sun, for example, the one thing I am not going to escape, is the license terms, or the repudiation thereof.

I get the feeling that this is more of a puppet show than a real attempt to engage anybody – still, I’ve offered them the opportunity to show their commitment to interoperability by being ready to accept bug reports from me when I try compiling it with Mono. I’m not sanguine about it.

Wraith, I agree with you that OOXML was approved in December, not July. I was using American date conventions (12/7/2006), not European (7/12/2006), thus the confusion. Having two standards for the same thing only leads to confusion, as you’ve demonstrated.

Wouldn’t the use of the sole ISO 8601 international date standard (2006-12-07) fix the problem once and for all?

In markup or for other computer-readable purposes, certainly. But for English prose the standard date format, at least in the US, is MM/DD/YY. But I do need to remember that I have an international readership and that I can’t allow ambiguities like that. So I’m giving some preference to “7 December 2006”, which is unambiguous and scans well. “2006-12-07” is very rarely used in English prose and, when I see it at least, requires more thought to process.

When, as an American, I even think of a date, I mentally do it as April 1st, 2000. Month, day, year. Any date, historical, personal, whatever, is mentally stored and retrieved in that order. Same with speech. I wonder if it is different in Europe? Do people process dates mentally in a different order? Or is it just a presentation difference when writing?

Just a tiny data point about dates and languages. I’m a native English speaker in the UK. For me, “April 10th, 2007” sounds less natural than “the 10th of April, 2007”. And it would generally be written 10/4/2007 here.

FWIW, I’ve suggested on various web sites that Google make ODF one of the standard documentation types, a la the option to convert PDFs into HTML, considering that the usual option, leaving the file formats as-is-where-is leaves one at the mercy of the persistent incompatibilities of the *.DOC file formats, which change with each new MS Office version.

So far I have no idea if anyone’s thought it worth thinking about; but if it was put into place, the de jure file format standard would become the de facto standard overnight.

Interesting article – I’ve experienced the flipside. I’ve already had 2 people send me documents in one of the newer Office formats that I can’t/don’t read. I’ve hardly ever had anyone send me an ODF file. I would have to assume that right now people are saving their new Office doc in older formats to put them on the web.