The ways people mess up IUnknown::QueryInterface, episode 4

One of the rules for IUnknown::Query­Interface is so
obvious that nobody even bothers to state it explicitly as a rule:
"If somebody asks you for an interface,
and you return S_OK,
then the pointer you return must point to the interface the caller
requested."
(This feels like the software version of
dumb warning labels.)

During compatibility testing for Windows Vista,
we found a shell extension that behaved rather strangely.
Eventually, the problem was traced to
a broken IUnknown::Query­Interface
implementation which depended subtly on the order in which
interfaces were queried.

The shell asked for the IExtract­IconA and
IExtract­IconW interfaces in the following
order:

// not the actual code but it gets the point across
IExtractIconA *pxia;
IExtractIconW *pxiw;
punk->QueryInterface(IID_IExtractIconA, &pxia);
punk->QueryInterface(IID_IExtractIconW, &pxiw);

One particular shell extension would return the same pointer to
both queries;
i.e.,
after the above code executed,
pxia == pxiw even though neither interface
derived from the other.
The two interfaces are not binary-compatible,
because IExtract­IconA::Get­Icon­Location
operates on ANSI strings, whereas
IExtract­IconW::Get­Icon­Location operates
on Unicode strings.

The shell called pxiw->Get­Icon­Location,
but the object interpreted the szIcon­File as
an ANSI string buffer; as a result, when the shell went to look at it,
it saw gibberish.

Further experimentation revealed that if the order of the two
Query­Interface calls were reversed,
then pxiw->Get­Icon­Location worked as expected.
In other words,
the first interface you requested "locked" the object into that
interface, and asking for any other interface just returned
a pointer to the locked interface.
This struck me as very odd; coding up the object this way
seems to be harder than doing it the right way!

Instead of implementing both IExtract­IconA and
IExtract­IconW, my guess is that they implemented
just one of the interfaces and made it alter its behavior based
on which interface it thinks it needs to pretend to be.
It never occurred to them that the single interface might need
to pretend to be two different things at the same time.

The right way of supporting two interfaces is to actually implement
two interfaces and not write a single morphing interface.

A different shell extension had a compatibility problem that also was
traced back to a dependency on the order in which the shell
asked for interfaces.
The shell extension registered as a context menu extension,
but when the shell tried to create it, it got E_NO­INTERFACE
back:

This was kind of bizarre.
I mean, the shell extension went to the effort of registering itself
as a context menu extension,
but when the shell said,
"Okay, it's show time, let's do the context menu dance!"
it replied,
"Sorry, I don't do that."

The vendor explained that the shell extension relies
on the order in which the shell asked for interfaces.
The shell used to create and initialize the extension like this:

(Of course, it's not written in so many words in the code;
the various parts are spread out over different components and
helper functions,
but this is the sequence of calls the shell extension sees.)

The vendor explained that their shell extension will not respond
to any shell extension interfaces (aside from
IShell­Ext­Init)
until it has been initialized,
because it is at that point that they decide which extensions
they want to support.
Unfortunately, this violates the first of the
four explicit rules for IUnknown::Query­Interface,
namely that the set of interfaces must be static.
(It's okay to have an object expose different interfaces
conditionally, as long as it understands that once it says yes or no
to a particular interface,
it is committed to answering the same way for the lifetime of the object.)

What would have been the correct solution to the last vendor's problem? Should they have made the decision about supporting each interface during the first call QueryInterface() rather than in Initialize()?

[What's the point of registering an object as a context menu extension if you're going to say, "Nah, I don't support IContextMenu"? The decision to support IContextMenu should have been made when you registered the object! -Raymond]

Ah, that's the complaint. Not that they couldn't dynamically decide later what the REST of their interface set looked like, but that they should always support for the one they registered as in the first place.

You are violating your own contract, "This is the first method that the Shell calls after it creates an instance of a property sheet extension, shortcut menu extension, or drag-and-drop handler." QueryInterface must be the second.

[The spec was not written by language lawyers but by people who assume you can read sentences in context. The context here is among the various shell-specific interfaces and methods. After all, you can also get an IMarshal request as part of CoCreateInstance. -Raymond]

This stuff must be frustrating for development. The more I read about the weird games people play in shell extensions the more I want to re-implement the open & save dialogs just so that I don't load any weird DLLs into my memory space.

@Peter: Before you do that, try Delphi. It makes COM much, much simpler. Microsoft may have invented COM, but it was the Delphi team that truly understood it. A lot of the "COM horror stories" I see on here and elsewhere would either not happen at all or simply be harder to get wrong than to get right in Delphi.

@Mason: I think you are rather missing the point that Raymond made in the article: even in raw COM, these mistakes required more effort on the part of the programmer than simply getting it right. These shell extensions would have been broken in Delphi too, but it would have taken the idiots longer to figure out how to work around Delphi's support, so they used raw COM instead. :(

[Yeah, it's like these people went out of their way to do something weird. I mean, really, adding weird logic to your QI? -Raymond]

Again, I call for the list of shame. Windows should keep (and dynamically update) a list of components with problems, and if an user installs one, there should be a message with the description of the vendor's sin. This would put a heat on the ISVs.

I've been here long enough to answer that one, Alex. What if the vendor doesn't exist anymore? What if the problem is in version 2000 of their product and they've sold 3 new versions since then? And so on…

If Microsoft does the "right thing" and breaks code that is broken, people blame Windows when things stop working. If Microsoft does the "right thing" and fixes other people's broken code, people won't stop complaining about "bloat" and people won't fix their code. And even if they do everything perfectly, people expect them to use a time machine to port new code back to when the 286 was a shiny new and powerful processor.

There are times I'm glad I don't work for Microsoft.

However, there is one obvious interim workaround. The language lawyers need to stand in front of the regular lawyers, when the revolution comes.

I agree that it is good to re-order COM calls to avoid breaking stuff that user depend on (even if they are poorly written).

However, since this isn't documented anywhere it makes it a nightmare for anyone else trying to host the same objects.

ActiveX control hosting is a great example. Unless you painstakingly work out which order IE queries & initialises interfaces, through trial & error and writing debug/test objects that you feed to IE to work out what it does, you'll run into all sorts of problems like the one described.

Sometimes the order of calls required to make one component work breaks another, too. There are some bits in the registry which certain objects are flagged with to deal with that, but I suspect some of it is also hardcoded within the hosting components.

Since COM objects only tend to be tested against one host (IE for ActiveX, Explorer for shell extensions, etc.), it's easy for them to accidentally rely on non-contractual behaviour of those hosts. And since new versions of those hosts are then tested against those buggy COM objects to avoid breaking them, those behaviour quirks will never change. But nobody else knows what those behaviours are. They're effectively an undocumented part of the API.

Of course, very few people write their own hosts and it absolutely makes sense to push the complexity towards the host rather than the hosted objects. I'm not arguing against that. I just wish all this voodoo was documented somewhere.

It doesn't have to be formal documentation or part of an API contract. Unofficial guidance would be enough, although if it is something that is never going to change because it would break things then it might as well be part of the API contract. We might not want it to be, but it is.

I'm going to assume that these gotchas eventually make it into documentation updates, but I was wondering if there are programs that can test your components to see if they depend on these sorts of undocumented behaviours.

[The rules for IUnknown are already spelled out in the existing documentation. It doesn't seem necessary to list every possible way of violating them. -Raymond]

It's okay for people to not update to the latest and greatest version of Windows. If a company looks at the next version of Windows and finds out that their favorite shell extension doesn't work with it they can A) ask the vendor for an upgrade, B) not upgrade the OS. At this point there's a cost/benefit decision. Do the new features of the OS justify an upgrade? Are they worth more than the shell extension? Does something else provide the functionality that we need in the shell extension, even if it's not as nice?

While I love the work that Windows does to be backwards compatible, this is one of those situations where I feel it would be a benefit to Windows, in the long run, if they didn't do this. Because then, people will start to expect greater things from new versions of Windows, instead of thinking a new version of Windows is lipstick put on another layer of old bloat.

I love how a well written program written decades ago can run on modern Windows. I don't like how a poorly written program written decades ago can run on modern Windows.

[And what if the vendor is out of business, or the vendor wants to charge money, or (as is common with LOB applications) you are the vendor and there's nobody left at your company who understands the code (or you lost the source code). It seems that you have forgotten that Windows exists in order to make customers happy and productive, not to piss off its customers in pursuit of the world's most architecturally pure and beautiful operating system. -Raymond]

Jader3rd: Are you aware that, even nearly 10 years after its release and 5 years after being superceded, Windows XP is still the most popular PC OS on the planet? And that's with MS doing everything humanly possible to make it compatible with old apps!

It turns out that the answer to "Do the new features of the OS justify an upgrade?" is "No" at least 99% of the time. If MS took the "It's okay for people to not update to the latest and greatest version of Windows" attitude, Windows 7 might be 10% awesomer, but that wouldn't help you as a developer because 90% of your customers wouldn't be using it.

For some industries (such as Banking/Stock trading related ones) there is security requirement that the company should not use any of the OSs that no longer has security updates. When WinXP goes out the extended support period in 2014, while some of the larger companies can still pay for extension beyond the extended support period, most other companies that have software that can't run on Vista or above will face difficult decision: whether to risk system rewrite of old components or allow the client OS continue to run without further security fixes, hence risk having a virus friendly network environment inside the company. (Note that if people write malicious software that take advantage of OS vulnerabilities to spread, antivirus softwares may not be able to stop them effectively.

So for the benefit of most OS users, people will want Microsoft to continue make those software "depending on undocumented implementation details" to run on newer versions.

@Cesar: Exactly what I was thinking. This thing was probably designed back in the 9x/NT split and 9x would only use the A version. I'm sure NT4&5 would ask for the W version first and if it failed, it would fall back to A. Why this was changed in Vista makes little sense to me since you want to use the W version if the object implements it. (Even if it was rewritten from scratch it would still make sense to ask for W first) Why Vista wants both versions is a even bigger question…

When receiving a QueryInterface for IID_IExtractIconA, if that pointer is NULL, it creates an IExtractIconA and stores it there. The same for IExtractIconW. If the pointer is already set, it is returned directly. Basically, simple lazy initialization of the part of the object which deals with whatever IExtractIcon* does. A perfectly legitimate thing to do, and not that complex.

The only mistake in this code is to use a single pointer for both interfaces, perhaps due to a mistaken assumption that Unicode hosts will only want to create the IExtractIconW, and that ANSI hosts will only want to create the IExtractIconA. The fix is simply to use two instance variables instead of one.

@WndSks: My guess is that the code in Explorer was changed to gets all the interfaces it might want to use at once, so all that code is in one place and separated from the logic of actually using the object.

That's a perfectly valid thing to do. QueryInterface on an in-process object is virtually free, so if you can simplify the code by doing that then it makes sense to do so.

This obviously depends how the object is being used, but there's nothing that weird about getting an interface on an object and then deciding not to use it in the end.