When two threads deadlock gdb appears unable to display the stack of a
deadlocking thread.
A backtrace of the thread contains the following message:
Previous frame identical to this frame (corrupt stack?)
The example sourcecode (see attached) isn't perfect. It displays most
of the stack of the locking thread and gives the error. I've seen many
examples where none of the stack is displayed.
To see the problem:
1) Compile the example: "g++ dl.C -o dl -lpthread"
2) "gdb dl"
3) "run"
4) wait a couple of seconds then press Ctrl-C
5) "thread 2"
6) "where"
----------
Action by: jrfuller
I followed the above instructions and get an error, but I am not sure
if the methodology is valid.
Here is the gdb output (note the SIGINT is my ctrl-c):
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/juan/dl
[Thread debugging using libthread_db enabled]
[New Thread -1218549504 (LWP 25330)]
[New Thread 26778544 (LWP 25337)]
Program received signal SIGINT, Interrupt.
[Switching to Thread -1218549504 (LWP 25330)]
0x00ae56e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 26778544 (LWP 25337))]#0 0x00ae56e1 in
__lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
(gdb) where
#0 0x00ae56e1 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0
#1 0x00ae2797 in _L_mutex_lock_28 () from /lib/tls/libpthread.so.0
#2 0x00ae9a3c in __JCR_LIST__ () from /lib/tls/libpthread.so.0
#3 0x01989bb0 in ?? ()
#4 0x01989a98 in ?? ()
#5 0x080485c4 in thread_fn1 () at dl.c:13
Previous frame identical to this frame (corrupt stack?)
(gdb)
Escalating for assessment.
J
Issue escalated to Support Engineering Group by: jrfuller.
----------
Action by: gavin
I have reproduced exactly the behavior JohnRay has on an up2date RHEL3
box.
As for valid methodology, it is true that the two threads are
deadlocked, and the aparent stack coruption may be a necessary and
valid side effect of that deadlock with NPTL, and we may have to just
explain this to the customer. On the other hand, deadlocks are one
the times when you most need an un-corrupt stack so you can debug
them, and so we should do what we can to protect the stack in these cases.
Issue escalated to Sustaining Engineering by: gavin.
Status set to: Accepted
----------
Action by: jrfuller
We have been able to reproduce this issue with the above test case.
Our initial assessment is that the stack corruption may be a necessary
and valid side effect of this specific deadlock within NPTL. However,
deadlocks are one of those times when you most need an un-corrupt
stack so you can properly debug the cause, so we will do what we can
to protect the stack in these cases.
We are still investigating the cause of this issue and will report
when we know more.
J
----------
Action by: adam.eastwick
Hi,
I did a test on my own and did not get this error when I ran the
above test case using LinuxThreads. Could this be a problem with the
NPTL threading library or interaction between NPTL and other tools?
I'll insert the text of my test below.
Thread 32769 (LWP 2013)]
[New Thread 16386 (LWP 2014)]
Program received signal SIGINT, Interrupt.
[Switching to Thread 16386 (LWP 2014)]
0xb759b074 in __pthread_sigsuspend () from /lib/i686/libpthread.so.0
(gdb) thread 2
[Switching to thread 2 (Thread 32769 (LWP 2013))]#0 0xb744f38a in poll ()
from /lib/i686/libc.so.6
(gdb) where
#0 0xb744f38a in poll () from /lib/i686/libc.so.6
#1 0xb7597d5e in __pthread_manager () from /lib/i686/libpthread.so.0
#2 0xb759802a in __pthread_manager_event () from
/lib/i686/libpthread.so.0
#3 0xb745808a in clone () from /lib/i686/libc.so.6
(gdb)
-A
Status set to: Waiting on Tech
----------
Action by: ezannoni
Suspect it's a problem with the debug information from glibc. I have
seen a few of bug reports like this. The gdb team is investigating,
but I'll escalate this to bugzilla as well.