A process shutdown puzzle, Episode 2

A customer reported that their program would very sporadically
crash in the function
Close­Thread­pool­Cleanup­Group­Members.
The customer was kind enough to provide a stack trace at the point
of the crash:

The customer wondered,
"Could the problem be that my cleanup group does not have
a callback?
MSDN seems to suggest that this is okay."

The exception being thrown is
STATUS_INVALID_PARAMETER,
but that doesn't really say much.

But that's okay, because the smoking gun isn't the exception
being raised.
It's in the stack.

Do you see it?

The code is calling
Close­Thread­pool­Cleanup­Group­Members
from inside DllMain
while handling the
DLL_PROCESS_DETACH notification.
Looking further up the stack, you can see this was triggered by
a call to ExitProcess,
and now all the stuff you know about
how processes exit
kicks in.

For example, that the first thing that happens is that all threads
are forcibly terminated.

That's your next clue.

Observe that the customer's DLL is trying to communicate with the
thread pool during process termination.
But wait, all the threads have already been terminated.
It's trying to communicate with a nonexistent thread pool.

The thread pool realizes,
"Hey, like I've already been destroyed.
I can't do what you ask because there is no thread pool any more.
You want me to block until all currently executing callback
functions finish,
but those callback functions will never finish (if they even exist at all)
because the threads hosting their thread pool got destroyed.
Not that I can tell whether they are executing or not,
because I am already destroyed.
The only options are to hang or crash.
I think I'll crash."

The customer needs to restructure the program so that it either
cleans up its thread pool work before the
ExitProcess,
or it can simply skip all thread pool operations when the reason for
the DLL_PROCESS_DETACH is process termination.

@Cesar: The code tearing down the thread pool is in a DLL, as indicated by __DllMainCRTStartup present on the call stack.

@Joshua: TerminateProcess is not an acceptable API for normal process shutdown. And it's certainly not an acceptable API for a DLL to use to terminate its host process under normal circumstances. If you believe either of those statements are false then I feel for your users.

One of the comments over in the "Clean-up functions can't fail because…" post says to tell the user when something bad happens, since maybe he can do something. Such as, "Out of memory? Close some other application."

I suppose I've opened a bit of a can of worms, but I'm interested in whether there's more to the story. For example, how did litware.dll got loaded? If contoso.exe called CoCreateInstance(ILitWareInterface) and then leaked a reference (perhaps not even calling CoUninitialize()) then the appropriate thing to do is fix the bug in contoso.exe.

Contrariwise, if litware.dll is being loaded by another process calling CreateRemoteThreadEx I am not sure how it's supposed to work, and I am willing to believe that your recommendation of "skip all cleanup if you're in process termination" is all that needs to be fixed.

[Contoso called LitWare_DoSomethingAwesome, and as part of its work, LitWare_DoSomethingAwesome queued up some background tasks. Then Contoso decided that it was done and exited. It had no idea that LitWare_DoSomethingAwesome was going to schedule additional work onto the thread pool. (For example, maybe the LitWare folks decided that to improve performance, the DoSomethingAwesome function would return immediately and finish doing the awesome stuff in the background.) -Raymond]

[I guess I don't understand your point, then. If you roll back at startup, then your service is unavailable for the entire lifetime of the process! -Raymond]

Easy. One creates checkpoints (committed transaction points) periodically, and if the proces crashes, it's restored to the saved checkpoint by rolling back all uncommitted transactions.

[I think we're talking about different things. You're talking about recovering from an app that crashes because it can't clean up properly. I'm talking about why the app is crashing at cleanup in the first place. -Raymond]

I see. So Litware assumed that their work items would complete before the process exited, and did not provide any way for Contoso to wait on the awesome thing being completed, or to abort doing the awesome thing because it's time to shut down now.

[Sure, that's one scenario. I'm sure you can be creative and come up with others. Just look at all the people who want to create a worker thread in their DLL_PROCESS_ATTACH. -Raymond]

Moral of the story: it would really really help if the error messages explained what went wrong. Just allocate that error condition one unique code, that you never, not *ever* reuse anywhere else. Then document what causes it. Much better than STATUS_INVALID_PARAMETER, isn't it?

You'll probably counter this by saying that back then the value was only 16 bits and you wouldn't have enough values for all the distinct error conditions. Let alone consistently generate non-colliding values. That's fair enough, I suppose.

The idea still holds though. If this were an exception-based API, I'd love to get an exception saying "Cannot wait for callbacks to finish because thread pool is already destroyed, and the callbacks will never return" instead of "An argument is invalid". The customer (who seems to be one of the smarter ones) would not have to ask for support then.

I'm ordinarily a .NET programmer who doesn't normally have to deal with DllMain or any of that sort of thing, so please correct me if I'm wrong, but it seems that the best way of doing this sort of thing is as follows:

The external library should have some sort of "Init" function (i.e. "InitLitware") that the host process should call from its own "main"/"WinMain" function (or anywhere really), rather than the library doing any sort of heavy initialisation in its "DllMain" function.

The host process should then be able to call some of the library's functions (i.e. "DoSomethingAwesome").

Lastly, the external library should also have some sort of "Cleanup" function (i.e. "CleanupLitware") that the host process should call when it is ready (i.e. before exiting from "main"/"WinMain", or calling "ExitProcess" or "TerminateProcess"), rather than the library doing any sort of cleanup in its "DllMain" function. This function should block the host process until it is complete (which includes waiting for any pending operations), so that the problem above can't occur.

Unfortunately "best" is open to many different interpretations, and some programmers think that it is best for the people who use their library not to worry about initialisation and cleanup, and this is done in DllMain. In some ways you can understand this because if a library has some form of persistant state and the developer forgets to signal to that library that it is shutting down, then that persistant state may become corrupt.

Well anyway, I agree with you, the executable knows better about the state of the program. It is also documented in the DllMain documentation that when the process is terminating, all threads besides the one you are currently in has been terminated. It is also documented that you shouldn't do anything too complex in DllMain, like communicate with other threads. But people don't read, people don't listen, they just do what they want and then get confused when the program crashes.

@Neil:

What motivated you to ask that question? Since what you get back from CreateThreadpool is a pointer to a user mode structure, can you give any guarantee that, after you close the thread pool, it doesn't clean up the memory right away? Accessing anything after you said you don't want to use it any more is a bug.

Say, I have such a litware, of which the vendor is long gone, that dare to do something with inappropiate timing on DllMain, but I still have to use the library. What advise can be provided to minimize to damage?

For the kind of thing that Raymond described, it might be good enough to load the DLL with LoadLibrary and unload it with FreeLibrary before you exit the process. That way the threads are still alive when the library tries to destroy the thread pool.