The compatibility constraints of error codes, episode 2

A customer reported an incompatibility in Windows 7: If A: is a floppy drive and they call Load­Library("A:\\foo.dll") and there is no disk in the drive, the Load­Library call fails with the error ERROR_NOT_READY. Previous versions of Windows failed with the error ERROR_MOD_NOT_FOUND.

Both error codes are reasonable responses to the situation. "The module couldn't be found because the drive is not ready." Programs should treat a failed Load­Library as a failed library load and shouldn't be sensitive to the precise reason for the error. (They can display a more specific error to the user based on the error code, but overall program logic shouldn't depend on the error code.)

Fortunately, the customer discovered this discrepancy during their pre-release testing and were able to accommodate this change in their program before ever releasing it. A sigh of relief from the application compatibility team.

I'm not going to complain about the lack of documented GLE values per function since I know that it is a hard problem and you should generally just bail on !success (Except for LogonUser()) but it seems unlikely to me that some floppy (removable media?) code was added to the loader. I assume not ready is never going to happen for a fixed disk so was the loader changed or some code deeper down in the kernel?

I think the reason why they don't document the error codes really comes down to preventing compatability constraints rather than it being a hard problem. If the list of errors is fully documented, then people will depend on that list, of course if the list changes then all of a sudden we can have an error no longer being caught, or an error can slightly change meaning. The documentation usually points out the important errors, but that is about it. To be honest, I prefer it this way.

[You're right. I didn't phrase that right. Better would be to say that program logic should be robust to error codes. -Raymond]

It turns out that create file failing because the file already exists is significantly different from other cases. Otherwise, it's very hard to make distributed transactional systems on top of filesystems.

The Windows 7 people got it right, but as usual, compatibility trumps almost everything.

Perhaps the original designers justified "not found" vs "not ready" by assuming the OS would give the user a chance to recover from a "door open" error before the program ever got the return code. Then "not found" would truly be "not found".

If so, there developed a right-hand left-hand problem and something fell through the cracks. When the program started getting the one code to cover both conditions, it had to add the "is the door still open" logic to address a "not found" error.

(This problem may be been present from the beginning of time when two design groups failed to synchronize, or it may have appeared later when to OS decided to degrade the importance of floppy drives because hard drives had become ubiquitous.)

Much better to divide the situation into 2 parts so they can be handled more easily. Trying to discover is the call failed because the door is open or if the wrong diskette was inserted adds significant effort to the program. Kudos for the Win7 people trying make life easier for applications!

(As one who wrote operating system software for mainframes, we had axioms "there can never be too much information in an error code" and "group codes for related reasons and sources together to allow quick filtering". So much of what we learned to hard way never made it into the heads of PC developers.)

This highlights (again) that the failure modes of a function are just as much part of the interface as its parameters and return value. The Java designers got it exactly right.

[If a function can fail in more than one way, and both failures apply, which one do you report? Does Java specify that if, say, Access Denied and Invalid Parameter both apply (say because you don't have access to the first parameter, and the second parameter is invalid), then one or the other must be raised in preference to the other? (Honest question.) -Raymond]

If I write a function that sits on top of any Windows API, and which passes the same error code back to my caller, all I can tell you about what it returns is "errors I explicitly coded, plus anything the underlying OS returns, including any lower layer components that may be invented tomorrow". And so recursively down the stack.

Ultimately, if you plug in a new device, that has the potential to change the error returns from my code.

The alternative, of course, is that I *don't* return the underlying error code directly. If I just return my own error code and discard what I got from the lower layers, that's hiding potentially useful information. If I return my own error code and also what I got from the lower layers, that gets unmanageable after two or three layers (been there, done that: VAX/VMS).

A better question is why "File Save" icons still use the image of a floppy. Mictosoft, beinf famous for running usability tests, should have noticed that the young users would have no idea what that means.

This is yet another smallish difficult problem to solve. My personal take is the following:

Design the API so that non-success results are not errors/exceptional. Then leave errors as clearly being undocumented and not having backwards compatibility constraints.

This is difficult. I did it for a family of internal APIs and while it was very successful, it still raises eyebrows. People don't get why you don't just check for ERROR_FILE_NOT_FOUND or catch the FileNotFound exception.

Part of the difficulty is that if there are multiple such non-success results, you need to differentiate them using some kind of codes/flags which ends up looking a lot like checking for specific error codes.

Searching an in-memory collection is a great case where returning NULL is a better result than returning ERROR_FILE_NOT_FOUND or throwing an exception.

One unmentioned difficulty here is that while CreateFileW() may return ERROR_FILE_NOT_FOUND in the case that the named directory does exist but the instance file does not, it's actually not as trivially guaranteed that that is the only case where ERROR_FILE_NOT_FOUND is returned. From my knowledge of the source code I'm not aware of any other cases but you could imagine a filter driver or some using some clever technique to hijack the API (for good cause mind you! These kinds of situations often start with good intentions…) but they call LoadLibrary() perhaps and next thing you know you're getting ERROR_FILE_NOT_FOUND for some reason other than that the file isn't present in the directory.

Given the movement towards "developer productivity", I am curious if such design issues will ever be addressed in any future computing platform and then whether this provides some kind of quantum limit to the correctness we can achieve if it is not. It's hard to imagine say 50 years from now deciding to "fix" this issue throughout the gobs and gobs of legacy code we'll have.

[If a function can fail in more than one way, and both failures apply, which one do you report? Does Java specify that if, say, Access Denied and Invalid Parameter both apply (say because you don't have access to the first parameter, and the second parameter is invalid), then one or the other must be raised in preference to the other? (Honest question.) -Raymond]

The documentation isn't always great, but they have something called the Technology Compatibility Kit (TCK) that you have to pass to be able to be call your implementation Java ™ and it does enforce these error ordering issues.