Interpreting the Imperial Network

Windows Installer: As anyone who has done .msi development knows, you will never find a more wretched hive of scum and villainy.

Visual Studio 2005 valiantly tries to make things easier by offering “Setup and Deployment” projects. This thing magically binds together the build outputs of other projects and burps out a plausible .msi file. Hoora……waitaminute, something’s not quite right here. Yeah.. it turns out that if you want anything but the barest minimum of shoving files and registry keys onto target machines, you’re going to have to do some post-processing, son. Fortunately, Microsoft provides a handy COM API for torturing the .msi SQL database until it agrees to do your bidding.

What? Oh, sorry…you didn’t know that an .msi was basically a demented relational database crammed into a file? Congratulations, now you can share my nightmares.

But I come here not to complain about the .msi file format, nor Visual Studio. The main course of today’s rant will be the installer engine itself, msiexec. Specifically Windows Installer 4, which led me on a merry chase today. I accidentally missed a dependency for one of my custom actions, and got the following lovely error:

The installer has encountered an unexpected error installing this package. This may indicate a problem with this package. The error code is 2869

I’ve been hacking on these beasties for a couple years, this was not my first dance with 2869. In fact, the internet is filled with stories about it. This isn’t actually an error about what went wrong during the install, it’s an error about [what went wrong when the installer was trying to tell you [what went wrong during the install] ]. This is what we call a masking error, meaning “Your installer is so broken, I can’t even tell you about it properly”. A detailed install log offers up:

DEBUG: Error 2869: The dialog ErrorDialog has the error style bit set, but is not an error dialog

Most forum threads about this error are from hapless end users trying to get their program download to work, and vendors supplying fixed versions. Everyone addresses the root cause, i.e. the error actually being thrown first, that eventually leads to the 2869. Often it has to do with impersonation problems. Well and good, but I already knew what was wrong with my custom action. What I wanted, and couldn’t find anywhere, was someone who understood why the error reporting mechanism itself was failing. (Spoilers: eventually found severalrightanswers. It’s easier to find them in retrospect once I knew what was wrong.)

What could be wrong with the ErrorDialog? This guy came right off the truck from Visual Studio, and my tweaks never touched it. Nevertheless, I spent about an hour poring over the documentation, trying to find any possible detail that was different between my .msi and the spec. It all checked out totally fine. But no matter what I tried, not an error dialog. Not an error dialog. My kingdom for an error dialog!

It’s such a pointed error, you see, and there are so many subtle requirements, I thought I must be missing something. Was it the phase of the moon or something about the feng shui orientation of my laptop? In casting about for a solution I happened upon this note. It talked about adding an entry to the Error table, which is advice I hadn’t seen before:

In order to see the actual error, open the MSI with ORCA and add the following entry to the “Error” table.

1001 | “Error[1]: [2]”

My logs never showed a 1001 error code, and a missing entry in the error table doesn’t have any relevance to the properties of the error dialog being correct. And yet, and yet… The page referred to 2869. With nothing to lose, I tried adding the entry. As if by magic, the error reporting immediately began working just fine. Total changes needed to the error dialog: zero. Total time wasted on this: one afternoon.

What happened? In this case not only is the 2869 masking the underlying error, but the windows installer engine itself was lying about the nature of the masking error, and as a side effect of the problem, hiding the real error code (1001) to boot! Why 2869 and not something like, “So listen, I see there’s no format string in the Error table for #1001… so regrettably I must now poop myself.”

I can totally imagine how it went down. The developer needs to implement a new error formatting behavior in version 4, but in the event that the .msi has broken error handling, he has to tell the user about it somehow. Adding a brand new error code would require changes up and down the source tree. It’s almost deadline already, and hey look: this error 2869 is pretty close, it’s about error dialogs not working. Surely anyone who gets that error will quickly understand what was meant. One line of code, and under the wire home free.