The Formats of Excel 2007

I’ve installed the new Office 2007. This isn’t my preferred platform. In fact I find I’m not using heavy-weight editors of any variety much. For every page I compose in a dedicated word processor I author perhaps 50 pages in emails, blogs or wiki’s. However, since I do have a license for Office 2007, and I am curious, I decided to take it for a spin. If you want to be a film critic, you’ve got to see the movies…

Here is a quick survey of what I saw in Excel 2007, concentrating on the file format support, my particular area of interest.

First, let’s look at the “Save As” dialog. As you can see from this screen capture, we have some new options:

The Default

The first choice saves in the default format. This is configurable under “Excel Options”, but by default this saves in the new Office Open XML (OOXML) format, with an “xlsx” file extension.

With Macros

The “Excel Macro-Enabled Workbook” option saves as an “xlsxm” extension. It is OOXML plus proprietary Microsoft extensions. These extensions, in the form of binary blob called vbaProject.bin, represent the source code of the macros. This part of the format is not described in the OOXML specification. It does not appear to be a compiled version of the macro. I could reload the document in Excel and restore the original text of my macro, including whitespace and comments. So source code appears to be stored, but in an opaque format that defied my attempts at deciphering it.

(What’s so hard about storing a macro, guys? It’s frickin’ text. How could you you screw it up? )

This has some interesting consequences. It is effectively a container for source code that not only requires Office to run it, but requires Office to even read it. So you could have your intellectual property in the form of extensive macros that you have written, and if Microsoft one day decides that your copy of Office is not “genuine” you could effectively be locked out of your own source code.

New Style Binary

The “Excel Binary Workbook” option caught me by surprise. This is not the legacy binary formats. This is not the new OOXML. This is a new binary format, with an “xlsb” extension. Similar to OOXML it has a Zip container file (the so-called Open Packaging Conventions container file format), but the payload consists (aside from a manifest) entirely binary files.

I can’t tell if they are some proprietary binary mapping of the OOXML XML, or whether this is an entirely new binary format unrelated to the XML format. In any case this format is entirely undocumented and is unreadable to anyone by Microsoft.

It is also interesting that Microsoft is positioning this format as the preferred one for performance and interoperability. The online help for Excel 2007 says:

In addition to the new XML-based file formats, Office Excel 2007 also introduces a binary version of the segmented compressed file format for large or complex workbooks. This file format, the Office Excel 2007 Binary (or BIFF12) file format (.xls), can be used for optimal performance and backward compatibility.

Old Style Binary

Find add-ins

This takes you to a page where you can download the “Microsoft Save as PDF or XPS” Add-in. Note that you are prompted to download an Add-in that provides support for both PDF and XPS. But if you hunt around a bit you can find another page where you can download just one format or the other, which is what I did, installing just the PDF support. This added a new option, “PDF” to the Save As dialog.

Other Formats

This brings up a dialog where you can choose from the previously mentioned formats as well as the several legacy export formats, including:

XML Data

Web Page

Text

Unicode Text

XML Spreadsheet 2003

Excel 5.0/95 Workbook

CSV

Formatted Text

DIF

SYLK

Summary

My overall impression was soured a bit by the large number of crashes I experienced. Indeed Excel crashed on exit on almost every session. This was dozens of crashes over the course of an afternoon. This will need to be fixed before I would trust it with my data.

Another curiosity was a legacy binary document that gave the following error message whenever I tried to save it to the new OOXML format:

It did not get this message when I saved it back to the binary format. So evidently I’m losing something when moving to OOXML, whatever “Line Print settings” are. So much for the claims of 100% backwards compatibility…

My examination also put to rest any lingering hope I had that Microsoft had fundamentally changed their position on proprietary file formats and has decided to follow in the paths of openness. The new proprietary binary format and the undocumented ways that macros are encoded put any hope of that to rest.

1/22/07, A quick update: Microsoft’s Doug Mahugh helped track down and fix the crash problem I had earlier reported when exiting Excel. This is a bug in the”Send to Bluetooth” COM Add-in that Excel was loading at startup. After disabling that Add-in, I’m no longer crashing.

Needless to say, this article has not received much praise from MS. Interesting, since they are usually quick to point out any effort in their direction.

“I can’t tell if they are some proprietary binary mapping of the OOXML XML”

It’s not one-to-one mapping. The binary format has more information in some places. Conversedly, the XML format has more information such as namespaces in some places too.

“In any case this format is entirely undocumented and is unreadable to anyone by Microsoft.”

An official paper was promised for Q1 2007. But its length would vary a lot depending on whether it will simply lists the BIFF12 records with little details of how they relate to others. Or if this will be comprehensive, and will have extra details that are missing in the ECMA specs. And there are a lot : for instance, there is a new mechanism to attach VBA macros (those bin parts) to worksheets or workbooks, …

Stephane Rodriguez post is interesting. XML, broadly speaking, is a way to encapsulate instructions. The purpose of XML is to make these instructions “human readable”. So why do I have to write a program to decipher the VB Macro or other data?

If the VBA macro and other binary parts are not disclosed, how is another vendor going to reproduce the document? We are not talking here about application-specific stuff that is naturally not expected to be carried from one application to another, such as the location of menu bars etc. We are talking about functionalities that users expects to be reproduced by another application.

Please note that I am not against embedding binary data in the zip pocket. It will be horrendously ugly if jpeg file is not binary data. This is embedding binary “instructions” which users expects to be portable across different applications inside the zip file.

This example, if anything, serves to show why proprietary extensions should not be allowed in file formats.

Did you test MS office 2007 on Vista or on XP. I found the the XP version to crash on some installations regularly as well but on others installtions to function fine. On Vista I have not encoutered a crash (yet).

I would certainly stay away from using MS office 2007 in a proffesional environment for at least 6 months but at home or in a test environment it seems very useable.

A bit anoying that the PDF add-in is not added by default. Adobe seems to be very weird about their ‘Open’ format.

I definitly would prefer that Microsoft add unprotected macro’s as XML parts and only add protected/runtime macro’s as encrypted XML or as binary.

A couple of points:1. Excel VBA macros have no other use except in Excel. They are not interoperable with other software, and thus they do not need an open standard. Toyota engine modifications do not work on Fords, and Apple executables do not run on Windows. Making Excel VBA work on other spreadsheets would mean either mean changing how the latter work internally (functions, hooks, etc.) or an creating an interpreter from scratch (which is such a huge task that MS is not doing it for Mac Office.) Given that, why do we need an open VBA format?

2. The binary format is not Excel’s new default format. Other programs do not need to support it: it is not a document exchange format. It is an internal format for the few cases where XML does not suffice for performance reasons. Most databases, including FOSS ones, use proprietary internal formats for exactly the same reason. They also provide document interchange (export) formats. Excel is doing nothing differently here.

First, Excel 97 is BIFF8 ; Excel 2000 is BIFF8 + a number of records ; Excel XP is BIFF8 + another number of records ; Excel 2003 is BIFF8 + yet another number of records + new OLE streams (XML data maps).

Well, Excel 97-2003 is actually BIFF8 + yet (again) another number of records + new OLE streams (XML data maps). The difference with Excel 2003 is that it stores BIFF12 records as new BIFF8 records, in strategic places. Unfortunately, it’s not a one-to-one map. For instance, it does not store content for rows and columns beyond the regular 64k grid, meaning that there is a loss of content. Plus the degradation due to the limited 56 colors, the lack of support for the new data bars or smart tables. And so on…

Yoon Kit, who at Microsoft has said that ODF would “confuse the market” because it is another file format to deal with? That doesn’t make sense, since ODF was around before Open XML, and everyone at Microsoft has been consistent in saying that choice is a good thing.

As for the XLSB format, it’s simply an option for users who have huge complex spreadsheets and are willing to forego standards in the name of performance. The kinds of spreadsheets that XLSB is used for are not viable in either Open XML or ODF, and those users have decided that they’d rather optimize performance than use an open standard format. Again, the user gets to choose, and the default format of Open XML is always there for Office users.

Rob, not sure what to say about Excel crashing dozens of times for you. I’m running it on Vista with no problems, and I’ve not heard that feedback before. What’s something specific that causes a crash? Do you get an error message?

Hal, This was Office 2007 running on Windows XP. My laptop is a almost a year old now, so it is not powerful enough to run Vista. Previously I had Office 2003 and did not get this crash.

Anonymous, Your logic is circular and can be summarized as, “No one else can read Excel macros, thus there is no reason to document how to do so”. But I can imagine a good number of reasons why someone would want this capability. Considering the virus vector provided by scripts in Office documents, I could see a security company might want to sell a product that scanned incoming Office documents at the gateway and identified potentially malicious macros based on what API’s they call. You could define sandboxes and perhaps only let in documents with macros that limited themselves to a specific subset of API calls. One could also imagine a company offering a service where documents could be uploaded and scripts inserted or updated to accomplish a particular task.

There are plenty of innovative things that could be done in this area, but by having a proprietary binary format for the scripts, Microsoft locks everyone else out. This is unfortunate considering the code was entered by the user as plain text to start with.

Damsel in distress – did someone dick with the damsel? Okay geniuses (and I mean that with respect, as I am not) My “bible” for work – a 400+ page Excel Workbook (not on a MAC), suddenly has reverted every date to whatever Mr. Rob’s very interesting but lost on me article re Gregorian and The Church calendar etc. to the 1904 option. However, when I look up the options, it has not been changed. So if someone, wanting to sabotage me, changed it to 1904, saved, reopened then changed it back to 1900, would the dates -4 years still appear in my workbook? Now, if changed it 1904, (which is incorrect for my platform) the dates return their correct year. Or, is this some virus or bug within Excel that suddenly reared its ugly head? Your input and help just might save me from tears (at work, which is so unprofessional!) Gracias

This save as menu is interesting for what it means to governements and other organisations that want to implement a policy of always using a ISO standard document format for corporate records. The user interface makes it both easy and attractive to choose among two or three formats (including the old binary format) that are not standard at all.

This user interface is close to be a worst case scenario for implementing such standard format policies because its makes impossible to obtain even a reasonable degree of user compliance.

The fastest way to load a large data file depends on the access patterns of the application, reads versus writes, sequential versus random access, etc. In some cases the fastest way is to use memory mapped files and just use the Windows memory manager to page in what you need when you need it. I’ve done that before with some success. Of course, interoperability, both with other applications as well as future and past versions of your own application, suffers when you map the format too closely to the application’s internals.

The bigger question is this: What has the industry failed to do, that end-users think 200MB spreadsheets are preferable to databases? Have we failed to make databases easy enough to use? Is this a matter of users not wanting to explicitly model data up front?

I get exactly the same crash-on-exit problem with Excel 2007. I’m running WindowsXP Pro SP-2 on a ThinkPad x-31. My Excel install is upgraded from Office XP. I have Excel 2007 on two other systems, also upgraded from OfficeXP and also on WidowsXP SP2 and Excel works flawlessly. So who knows… But you’re not the only one getting this annoying crash.

I am becoming very nervous about Excel 2007’s own formats and although I work in Excel 2007 I work in compatibility mode, saving as an Excel 93-2007 workbook. I believe I found a bug where if the compatibility checker finds a 2007 specific feature, then thereafter some analysis toolpak functions like EOMONTH get trashed as !#VALUE errors if saved as a 97-2003 workbook. I can’t be sure of the exact cause, but it means that operating interchangeably in both file modes feels precarious.

I have the same crash problem with excel 2007 running on XP, the problem may not be linked to Btsendto_office.dll as this dll is not present on my system. If someone Knows how to fix the problem XP environment, I would be very pleased to know about it. Kingson

This the file created under XL 2003329,,,,,,,,,,,,,,A956,,,,,,,,,,,,,,A994,,,,,,,,,,,,,,A1061,,,,,,,,,,,,,,E1697,,,,,,,,,,,,,,A2205,,,,,,,,,,,,,,A2688,,,,,,,,,,,,,,A3144,,,,,,,,,,,,,,A3291,,,,,,,,,,,,,,A4658,,,,,,,,,,,,,,A4984,,,,,,,,,,,,,,A

>> A couple of points:1. Excel VBA macros have no other use except in Excel. They are not interoperable with other software, and thus they do not need an open standard. Toyota engine modifications do not work on Fords, and Apple executables do not run on Windows. Making Excel VBA work on other spreadsheets would mean either mean changing how the latter work internally (functions, hooks, etc.) or an creating an interpreter from scratch (which is such a huge task that MS is not doing it for Mac Office.) Given that, why do we need an open VBA format?

Not sure what the car example has to do with open source or with monopolies. Microsoft is a closed source based monopolist. I also consider open source to be a part of the context for discussion.

Simply, the car companies are not monopolies. If they were though (forgetting for a minute that they would be regulated much more), how would you describe open source for cars unless there was a manufacturer that opened up all the designs and chips used in the car (boiling them down to cots components). So the car industry neither has open source nor monopolies. Your example was horrible.

Anyway, the “solution” is simple: provide the source for macros (and compile that after they are loaded) or at least some sort of byte codes. OR, define the binary specs and then other products can build a compatibility layer of some sort.

Of course, this would STILL BE A GAME. Microsoft has every business incentive to extend the standard in a closed fashion. Practicality dictates they would do that anyway as long as they control the apps and the OS and insist on integration as a way to define the value of their products. This would happen no matter what protocol/spec is defined.

The way out is not to rely on a closed source monopolist as your vendor.

>> Most databases, including FOSS ones, use proprietary internal formats for exactly the same reason. They also provide document interchange (export) formats. Excel is doing nothing differently here.

FOSS databases are open source. This means the “internal formats” are defined openly (if a bit implicitly). Also, since when are macros something that should not be interchanged across products? To say they are not means that part of the document format is closed. No part of a format is closed when we use FOSS. No part need be closed when we use closed apps but inevitably this will happen, all the more so when the vendor is a monopolist since their position is strengthened when they actually violate the standard.

In short, FOSS is the way to make sure that everything is open and can be reproduced in other apps. Also, in theory, internal formats can be specified for closed source apps, too. That this would interfere with the obscurity cloud behind which closed source vendors want to hide in the first place is another story.

I don’t know if I wrote this here already, but here goes my experience with excel 2007…

I just installed excel 2007 and when I open an old file (created on excel 2000 I guess), Excel show the message “Line Print Settings” as an incompatible resource if I try to save in the new 2007 format (.XLSX). But it saves the file, ok. The true problem happens when I try to save this file back as a excel97-2003 file (.XLS), Excel will simply close and refuse to save the file. If I try to run “prepare – check for compatibility problems” excel will close suddenly too.

I am simply unable to recover a file converted to the new format. I guess this could be a potential problem to much people.

I had a similar problem with an old worksheet. I was crashing a lot when editing it. I did a save as, and chose the new Office 07 format. It would throw an error trying to save that, stating there were problems with the Line print settings.

My final fix was to open the document in Office 2003. thejn I saved it as an Excel 4.0 WORKBOOK. Workbook being bold because the first option is worksheet, which didn't work.

Once it was in Excel 4 format I was able to open that with Office 2007, which promptly told me that the document was unreadable, but that it would try to recover the document, at which point it would ask me where to save the new one.

That resulted in a readable document that could be edited and saved without error. The only real problem with the resulting document was external links were severed, and charts were no longer correctly sized.

Beats entering them all again.

My $0.02 on 07 is that is is a nice new interface that takes a lot of getting used to after using the old interface for roughly 15 years. Considering that there's practically no new functionality for the average user, it is pretty unacceptable that this wasn't implemented as a skin that could be turned on or off at will.

Just ran into this “line print settings” problem today. Saving an old (probably Excel 2000) worksheet to Excel 2007 format. In my case I don’t have any macros or VB code at all, so I have no clue where Excel thinks these “line print settings” are stored. In any event, I saved the worksheet first as Excel 2003 XML format, which “leaves out all incompatible settings”. Then, opened the XML worksheet and saved as Excel 2007 workbook format. The sheet works fine and gives no error. Hope this alternate approach might be of help to someone.