Eugene
added a comment - 11/May/11 14:02 Steve, I have reviewed src\qpid\client\ConnectionImpl.cpp for last version and don't find any changes(may be I'm wrong).
I think the main problem is in using static object:
IOThread& theIO()
{
static IOThread io(SystemInfo::concurrency());
return io;
}
which is destroyed on shutdown and tries wait for thread completing in destructor.

Thread 2 has nothing left to do except the Windows DLL_THREAD_DETACH
handshake in all associated DLLs. It has clearly finished from the
point of view of the Windows implementation of the Qpid Thread class.

One way to fix this would be to use a private synchronization
mechanism that completes before the call to endthreadex(). I will put
a patch together for review along these lines unless someone objects.

Cliff Jansen
added a comment - 16/May/11 08:41 I can reproduce the problem with a program that is not linked against
qpidclientd but loads and unloads it directly. A deadlock occurs
processing the global static destructors.
Thread 1 blocks waiting for thread 2 to exit/die. It also owns the
infamous loader lock (thanks to the use of FreeLibrary())
ntdll!NtWaitForSingleObject+0xa
KERNELBASE!WaitForSingleObjectEx+0x79
qpidcommond!qpid::sys::Thread::join+0x66 [thread.cpp @ 82]
qpidclientd!qpid::client::`anonymous namespace'::IOThread::~IOThread+0x141
qpidclientd!`qpid::client::`anonymous namespace'::theIO'::`2'::`dynamic atexit destructor for 'io''+0x26
qpidclientd!_CRT_INIT+0x2dc
qpidclientd!__DllMainCRTStartup+0x11f
qpidclientd!_DllMainCRTStartup+0x31
ntdll!LdrpUnloadDll+0x27d
ntdll!LdrUnloadDll+0x4a
KERNELBASE!FreeLibrary+0x1d
foo!main+0x112
Thread 2 is a zombie trying to die, if only it could get the loader lock
ntdll!NtWaitForSingleObject+0xa
ntdll!RtlpWaitOnCriticalSection+0xe8
ntdll!RtlEnterCriticalSection+0xd1
ntdll!LdrShutdownThread+0x72
ntdll!RtlExitUserThread+0x38
MSVCR90D!_endthreadex+0x33
qpidcommond!`anonymous namespace'::runRunnable+0x3b [thread.cpp @ 34]
MSVCR90D!_callthreadstartex+0x25
MSVCR90D!_threadstartex+0xbd
kernel32!BaseThreadInitThunk+0xd
ntdll!RtlUserThreadStart+0x1d
Deadlock: Thread 1 does not wakeup until Thread 2 really dies, hence
neither thread can make progress.
Thread 2 has nothing left to do except the Windows DLL_THREAD_DETACH
handshake in all associated DLLs. It has clearly finished from the
point of view of the Windows implementation of the Qpid Thread class.
One way to fix this would be to use a private synchronization
mechanism that completes before the call to endthreadex(). I will put
a patch together for review along these lines unless someone objects.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/904/
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/904/#review849
-----------------------------------------------------------

I'd also add that we don't use the hungarian variable notation either so the 'p' should be stripped (possibly from other places too)

Andrew

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/904/#review849
-----------------------------------------------------------

The patch was published in the review board (https://reviews.apache.org/r/904/) where there was a comment regarding dynamic creation and deletion of the map of Qpid Thread objects:

Alan Conway:

Why the pointer? Just declare

std::map<unsigned, ThreadPrivate::shared_ptr> pQpidThreads;

and let std::map take care of the memory management.

Andrew Stitcher:

I'd also add that we don't use the hungarian variable notation either so the 'p' should be stripped (possibly from other places too)

Cliff Jansen:

The intended logic was that Thread::current() should have defined behavior, even if called from a static destructor. Otherwise, you need a way to guarantee the std:map destructor is never called too soon.

The original JIRA's deadlock was in the context of a static destructor calling Thread::join().

On re-examiniation, the use of a non-POD lock mechanism is inconsistent with the stated goal. Thank-you for the heads up. I will work with Steve to address this and the naming problem.

On yet further examination, even the PODMutex is not guaranteed to have defined behaviour during the static destructor resolution at process exit or library unload. The "static deinitialization order fiasco" has no current solution in the Boost mutex implementation (see http://www.justsoftwaresolutions.co.uk/articles/implementing_mutexes.html) despite attempts to address.

I see three possible ways forward:

1. Ship with the proposed fix (updated to address the review comments). It provides improved functionality, but a known potential random outage.

2. Use a Windows CRITICAL_SECTION as the mutex, and initialize and release it outside the C++ runtime in DllMain. This is a common pattern in Windows and is in fact used in the DTC plugin module in the wcf subtree.

3. Use a spin lock (via interlocked operations) to manage the exit/unload condition. Such a spin lock holds no system resources to leak and requires no explicit shutdown. This is the mechanism adopted by the Boost shared_ptr implementation of atomic operations.

If anyone has a further suggestion, or a preference for option 1 please let me know. Otherwise I will put up examples of options 2 and 3 to the review board.

Cliff Jansen
added a comment - 21/Jun/11 08:56 The patch was published in the review board ( https://reviews.apache.org/r/904/ ) where there was a comment regarding dynamic creation and deletion of the map of Qpid Thread objects:
Alan Conway:
Why the pointer? Just declare
std::map<unsigned, ThreadPrivate::shared_ptr> pQpidThreads;
and let std::map take care of the memory management.
Andrew Stitcher:
I'd also add that we don't use the hungarian variable notation either so the 'p' should be stripped (possibly from other places too)
Cliff Jansen:
The intended logic was that Thread::current() should have defined behavior, even if called from a static destructor. Otherwise, you need a way to guarantee the std:map destructor is never called too soon.
The original JIRA's deadlock was in the context of a static destructor calling Thread::join().
On re-examiniation, the use of a non-POD lock mechanism is inconsistent with the stated goal. Thank-you for the heads up. I will work with Steve to address this and the naming problem.
On yet further examination, even the PODMutex is not guaranteed to have defined behaviour during the static destructor resolution at process exit or library unload. The "static deinitialization order fiasco" has no current solution in the Boost mutex implementation (see http://www.justsoftwaresolutions.co.uk/articles/implementing_mutexes.html ) despite attempts to address.
I see three possible ways forward:
1. Ship with the proposed fix (updated to address the review comments). It provides improved functionality, but a known potential random outage.
2. Use a Windows CRITICAL_SECTION as the mutex, and initialize and release it outside the C++ runtime in DllMain. This is a common pattern in Windows and is in fact used in the DTC plugin module in the wcf subtree.
3. Use a spin lock (via interlocked operations) to manage the exit/unload condition. Such a spin lock holds no system resources to leak and requires no explicit shutdown. This is the mechanism adopted by the Boost shared_ptr implementation of atomic operations.
If anyone has a further suggestion, or a preference for option 1 please let me know. Otherwise I will put up examples of options 2 and 3 to the review board.

Steve Huston
added a comment - 23/Jun/11 21:28 Reassigning this to Cliff at least until the review and new patch are resolved.
Cliff, I still can't see your comments on the review board... can anyone else (Alan, Andrew)?

I'd also add that we don't use the hungarian variable notation either so the 'p' should be stripped (possibly from other places too)

OK... this was previously invisible because I missed the <Publish> step

The intended logic was that Thread::current() should have defined behavior, even if called from a static destructor. Otherwise, you need a way to guarantee the std:map destructor is never called too soon.

The original JIRA's deadlock was in the context of a static destructor calling Thread::join().

On re-examiniation, the use of a non-POD lock mechanism is inconsistent with the stated goal. Thank-you for the heads up. I will work with Steve to address this and the naming problem.

I have a fixed version I will upload to a separate review... coming soon.

Cliff

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/904/#review849
-----------------------------------------------------------

jiraposter@reviews.apache.org
added a comment - 30/Jun/11 20:39
On 2011-06-16 12:56:46, Alan Conway wrote:
> /trunk/qpid/cpp/src/qpid/sys/windows/Thread.cpp, line 85
> < https://reviews.apache.org/r/904/diff/1/?file=21139#file21139line85 >
>
> Why the pointer? Just declare
>
> std::map<unsigned, ThreadPrivate::shared_ptr> pQpidThreads;
>
> and let std::map take care of the memory management.
Andrew Stitcher wrote:
I'd also add that we don't use the hungarian variable notation either so the 'p' should be stripped (possibly from other places too)
OK... this was previously invisible because I missed the <Publish> step
The intended logic was that Thread::current() should have defined behavior, even if called from a static destructor. Otherwise, you need a way to guarantee the std:map destructor is never called too soon.
The original JIRA's deadlock was in the context of a static destructor calling Thread::join().
On re-examiniation, the use of a non-POD lock mechanism is inconsistent with the stated goal. Thank-you for the heads up. I will work with Steve to address this and the naming problem.
I have a fixed version I will upload to a separate review... coming soon.
Cliff
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/904/#review849
-----------------------------------------------------------
On 2011-06-15 01:08:18, Steve Huston wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/904/
-----------------------------------------------------------
(Updated 2011-06-15 01:08:18)
Review request for qpid.
Summary
-------
Keeps track of Qpid runnable threads and other threads, ensuring that rundown doesn't deadlock.
This addresses bug QPID-3256 .
https://issues.apache.org/jira/browse/QPID-3256
Diffs
-----
/trunk/qpid/cpp/src/qpid/sys/windows/Thread.cpp 1132733
Diff: https://reviews.apache.org/r/904/diff
Testing
-------
Qpid regression test suite.
Thanks,
Steve

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/
-----------------------------------------------------------

Review request for qpid.

Summary
-------

This is the same logic as the preceding version with naming fixes and refinements to DLL cleanup.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review947
-----------------------------------------------------------

My concern with DllMain is that it precludes building qpid as static libraries (and having this logic still work).

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review949
-----------------------------------------------------------

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/904/#review948
-----------------------------------------------------------

> My concern with DllMain is that it precludes building qpid as static libraries (and having this logic still work).

Good point. I was so focused on the bug as presented I lost complete sight of this valid use case.

I suppose I could do away with locks and maps altogether with use of TLS storage. In this case the only leaked resource would be the TLS slot on multiple uses of LoadLibrary/FreeLibrary, which is obviously not relevant in the static build case. So I should be able to use logic to free the slot in DllMain and comment out that code in a static build. Does this seem reasonable?

Cliff

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review947
-----------------------------------------------------------

jiraposter@reviews.apache.org
added a comment - 01/Jul/11 20:08
On 2011-07-01 11:44:51, Steve Huston wrote:
> My concern with DllMain is that it precludes building qpid as static libraries (and having this logic still work).
Good point. I was so focused on the bug as presented I lost complete sight of this valid use case.
I suppose I could do away with locks and maps altogether with use of TLS storage. In this case the only leaked resource would be the TLS slot on multiple uses of LoadLibrary/FreeLibrary, which is obviously not relevant in the static build case. So I should be able to use logic to free the slot in DllMain and comment out that code in a static build. Does this seem reasonable?
Cliff
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/987/#review947
-----------------------------------------------------------
On 2011-07-01 03:08:08, Cliff Jansen wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/987/
-----------------------------------------------------------
(Updated 2011-07-01 03:08:08)
Review request for qpid.
Summary
-------
This is the same logic as the preceding version with naming fixes and refinements to DLL cleanup.
Cleanup now uses Windows DllMain function to allows cleanup after C++ runtime static destructors.
This addresses bug qpid-3256.
https://issues.apache.org/jira/browse/qpid-3256
Diffs
-----
/trunk/qpid/cpp/src/qpid/sys/windows/Thread.cpp 1141687
Diff: https://reviews.apache.org/r/987/diff
Testing
-------
Qpid cmake run_tests
Thanks,
Cliff

> My concern with DllMain is that it precludes building qpid as static libraries (and having this logic still work).

Cliff Jansen wrote:

Good point. I was so focused on the bug as presented I lost complete sight of this valid use case.

I suppose I could do away with locks and maps altogether with use of TLS storage. In this case the only leaked resource would be the TLS slot on multiple uses of LoadLibrary/FreeLibrary, which is obviously not relevant in the static build case. So I should be able to use logic to free the slot in DllMain and comment out that code in a static build. Does this seem reasonable?

Yes it does - thanks!

Steve

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review947
-----------------------------------------------------------

jiraposter@reviews.apache.org
added a comment - 01/Jul/11 20:41
On 2011-07-01 11:44:51, Steve Huston wrote:
> My concern with DllMain is that it precludes building qpid as static libraries (and having this logic still work).
Cliff Jansen wrote:
Good point. I was so focused on the bug as presented I lost complete sight of this valid use case.
I suppose I could do away with locks and maps altogether with use of TLS storage. In this case the only leaked resource would be the TLS slot on multiple uses of LoadLibrary/FreeLibrary, which is obviously not relevant in the static build case. So I should be able to use logic to free the slot in DllMain and comment out that code in a static build. Does this seem reasonable?
Yes it does - thanks!
Steve
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/987/#review947
-----------------------------------------------------------
On 2011-07-01 03:08:08, Cliff Jansen wrote:
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/987/
-----------------------------------------------------------
(Updated 2011-07-01 03:08:08)
Review request for qpid.
Summary
-------
This is the same logic as the preceding version with naming fixes and refinements to DLL cleanup.
Cleanup now uses Windows DllMain function to allows cleanup after C++ runtime static destructors.
This addresses bug qpid-3256.
https://issues.apache.org/jira/browse/qpid-3256
Diffs
-----
/trunk/qpid/cpp/src/qpid/sys/windows/Thread.cpp 1141687
Diff: https://reviews.apache.org/r/987/diff
Testing
-------
Qpid cmake run_tests
Thanks,
Cliff

Cliff Jansen
added a comment - 05/Jul/11 18:52 New patch. Now uses Thread local storage (TLS) instead of a std::map to locate the ThreadPrivate structure. DllMain baggage still there to manage the FreeLibrary case.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/
-----------------------------------------------------------

(Updated 2011-07-05 17:54:17.154114)

Review request for qpid.

Changes
-------

Now uses Thread local storage (TLS) instead of a std::map to locate the ThreadPrivate structure. DllMain baggage still there to manage the FreeLibrary case.

Summary
-------

This is the same logic as the preceding version with naming fixes and refinements to DLL cleanup.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review965
-----------------------------------------------------------

Ship it!

Cliff,

Thanks for your effort on this.

Without this patch test cases would deadlock under NUnit probably the same as under Excel.
With the patch the test cases exit and NUnit is happy.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:https://reviews.apache.org/r/987/#review981
-----------------------------------------------------------

Ted Ross
added a comment - 12/Jul/11 19:11 This commit appears to break the mingw32 build. There are two problems that I'm aware of:
1) In the class ThreadPrivate, the initializers in the constructors are not in the same order as the declarations.
2) error: 'OpenThread' was not declared in this scope from line 71.

Justin Ross
added a comment - 19/Jul/11 19:40 This landed after we branched for release, so I've adjusted the fix version to reflect that.
I'd like to also consider introducing this for 0.12. Any thoughts or objections?

I wanted to see the patch run through Steve's nightly review board automated build and test.

It started failing precisely at the same time I did the checkin to trunk, not a good sign. That plus the recent mingw build problem makes me wonder if it is a good idea to add it to the release at this time. The person who reported the JIRA has an earlier version of the patch that works for him.

But I can't say for sure if there is a problem. It may be differences between my setup and Steve's, i.e. Xp versus WS2008, or different versions of boost or Visual Studio, or 32 vs 64 bit. Or the Riverace build may be failing for some unrelated reason.

Cliff Jansen
added a comment - 19/Jul/11 20:17 I wanted to see the patch run through Steve's nightly review board automated build and test.
It started failing precisely at the same time I did the checkin to trunk, not a good sign. That plus the recent mingw build problem makes me wonder if it is a good idea to add it to the release at this time. The person who reported the JIRA has an earlier version of the patch that works for him.
But I can't say for sure if there is a problem. It may be differences between my setup and Steve's, i.e. Xp versus WS2008, or different versions of boost or Visual Studio, or 32 vs 64 bit. Or the Riverace build may be failing for some unrelated reason.

Justin Ross
added a comment - 21/Jul/11 15:49 Cliff, I'd like to produce RC2 today, and if we want to go ahead with this patch, I'd like to get it in RC2 (RC3 would be too late). How does that strike you?