[Sbcl-devel] Some thread & signal safety issues

Some issues I've discovered, that I don't remember seeing on the list
before:
1. GC vs GC deadlock
We can get deadlocks if T1 is already in gc_stop_the_world when T2
tries to initiate another GC via interrupt_handle_pending: T2 has
SIG_STOP_FOR_GC blocked and will wait forever on *ALREADY-IN-GC* while
T1 waits for it to stop.
2. GC running user code inside signal handler.
*AFTER-GC-HOOKS* can run inside the SIGSEGV signal handler.
3. GC vs timer deadlock
We grab *scheduler-lock* inside the SIGALRM handler. This may or may
not happen during a WITHOUT-GCING section, leading to GC deadlocks.
4. Timers running user code inside signal handler.
5. Pthread functions not safe to use in signal handlers.
I'm looking at sorting these out, but if anyone has already burned a few
neurons on this, I'd be delighted to hear about their plans.
Cheers,
-- Nikodemus Siivola

Thread view

Some issues I've discovered, that I don't remember seeing on the list
before:
1. GC vs GC deadlock
We can get deadlocks if T1 is already in gc_stop_the_world when T2
tries to initiate another GC via interrupt_handle_pending: T2 has
SIG_STOP_FOR_GC blocked and will wait forever on *ALREADY-IN-GC* while
T1 waits for it to stop.
2. GC running user code inside signal handler.
*AFTER-GC-HOOKS* can run inside the SIGSEGV signal handler.
3. GC vs timer deadlock
We grab *scheduler-lock* inside the SIGALRM handler. This may or may
not happen during a WITHOUT-GCING section, leading to GC deadlocks.
4. Timers running user code inside signal handler.
5. Pthread functions not safe to use in signal handlers.
I'm looking at sorting these out, but if anyone has already burned a few
neurons on this, I'd be delighted to hear about their plans.
Cheers,
-- Nikodemus Siivola

On Monday 19 March 2007 16:08, Nikodemus Siivola wrote:
> Some issues I've discovered, that I don't remember seeing on the list
> before:
>
> 1. GC vs GC deadlock
>
> We can get deadlocks if T1 is already in gc_stop_the_world when T2
> tries to initiate another GC via interrupt_handle_pending: T2 has
> SIG_STOP_FOR_GC blocked and will wait forever on *ALREADY-IN-GC*
> while T1 waits for it to stop.
interupt_maybe_gc_int unmasks SIG_STOP_FOR_GC before calling into lisp=20
so the scenario above is not sufficient to prove it broken.
> 2. GC running user code inside signal handler.
>
> *AFTER-GC-HOOKS* can run inside the SIGSEGV signal handler.
Yes, this has been there for ages. In general - even without considering=20
user code - we run way to many things from signal handlers. To solve=20
this async signal safety problem arrange_return_to_lisp_function was=20
suggested as a solution, but that I fail to see how it would change=20
anything.
> 3. GC vs timer deadlock
>
> We grab *scheduler-lock* inside the SIGALRM handler. This may or
> may not happen during a WITHOUT-GCING section, leading to GC
> deadlocks.
That's a good catch. There must be a whole class of similar, thorny=20
locks waiting to catch the unwary. Is the situation bleak enough as to=20
make us stick a without-interrupts into the definition of=20
without-gcing?
> 4. Timers running user code inside signal handler.
We don't have much choice but to execute those in a separate thread.=20
Maybe someone has an insanely clever stack frobbing idea, that trumps=20
a_r_t_l_f.
> 5. Pthread functions not safe to use in signal handlers.
Well, most of them are not safe. The exception is semahores:=20
http://www.gnu.org/software/libc/manual/html_node/POSIX-Semaphores.html
On the other hand, I seem to recall that futexes are async signal safe,=20
but cannot find a reference now.
> I'm looking at sorting these out, but if anyone has already burned a
> few neurons on this, I'd be delighted to hear about their plans.
>
> Cheers,
G=E1bor
> -- Nikodemus Siivola

G=E1bor Melis wrote:
>> 1. GC vs GC deadlock
>>
>> We can get deadlocks if T1 is already in gc_stop_the_world when T2
>> tries to initiate another GC via interrupt_handle_pending: T2 has
>> SIG_STOP_FOR_GC blocked and will wait forever on *ALREADY-IN-GC*
>> while T1 waits for it to stop.
>=20
> interupt_maybe_gc_int unmasks SIG_STOP_FOR_GC before calling into lisp =
> so the scenario above is not sufficient to prove it broken.
Ooops, missed that! *blush*
This was my interpretation for one of the hangs I have been investigating=
:
T1 is spinning around sched_yield in gc_stop_the_world.
T2 is hanging in a call to GET-MUTEX in SUB-GC, refusing to stop.
Perhaps my diagnosis was too hasty and the mutex in question is not
*ALREADY-IN-GC*, but something else -- but looking at this (don't have
a live hang to poke at right now) I cannot for the life of me think of
another mutex that T2 could be waiting on.
Need to think more.
>> 2. GC running user code inside signal handler.
>>
>> *AFTER-GC-HOOKS* can run inside the SIGSEGV signal handler.
>=20
> Yes, this has been there for ages. In general - even without considerin=
g=20
> user code - we run way to many things from signal handlers. To solve=20
> this async signal safety problem arrange_return_to_lisp_function was=20
> suggested as a solution, but that I fail to see how it would change=20
> anything.
Yes. I think another thread on multithreaded builds and current state
of affairs on unithreaded ones seems reasonable.
>> 3. GC vs timer deadlock
>>
>> We grab *scheduler-lock* inside the SIGALRM handler. This may or
>> may not happen during a WITHOUT-GCING section, leading to GC
>> deadlocks.
>=20
> That's a good catch. There must be a whole class of similar, thorny=20
> locks waiting to catch the unwary. Is the situation bleak enough as to =
> make us stick a without-interrupts into the definition of=20
> without-gcing?
Maybe. It certainly is looking pretty bleak. Bleak enough that there are =
moments I think we should move WITHOUT-GCING out of SB-SYS and tack a %=20
in front of it...
>> 4. Timers running user code inside signal handler.
>=20
> We don't have much choice but to execute those in a separate thread.=20
> Maybe someone has an insanely clever stack frobbing idea, that trumps=20
> a_r_t_l_f.
Agreed.
>> 5. Pthread functions not safe to use in signal handlers.
>=20
> Well, most of them are not safe. The exception is semahores:=20
> http://www.gnu.org/software/libc/manual/html_node/POSIX-Semaphores.html=
>=20
> On the other hand, I seem to recall that futexes are async signal safe,=
=20
> but cannot find a reference now.
Right, but on some platforms we use pthread_cond_wait and stuff to=20
implement futex_wait, so I think it is best to act as it thet aren't
signal-safe.
(Apropos, there is some fishy-looking stuff in pthread_futex code, but I
haven't thought it thought properly. I'll try and get back on that with
a few questions.)
Cheers,
-- Nikodemus