[Sbcl-devel] threads vs gc fixes

Threaded SBCL 0.8.21 locks up on:
(progn
(defun waste (&optional (n 100000))
(loop repeat n do (make-string 16384)))
(loop for i below 10 do
(format t "LOOP:~A~%" i)
(force-output)
(sb-thread:make-thread
#'(lambda ()
(waste)))
(waste)
(gc)))
It is mostly because the delivery of SIG_STOP_FOR_GC is unreliable. There are
a few problems:
1) a thread struct with pid 0 may be on all_threads, and a signal delivered to
pid 0 is sent to every process in the process group of the current process
(not sure it was triggered)
2) gc_start_the_world does not wait for the signaled thread to process
SIG_STOP_FOR_GC => the signal may queue up and thread->state gets out of sync
with reality => gc confusion (this is not likely to happen either)
3) threads start up in STATE_STOPPED. If gc_stop_the_world sees this state it
does not send a signal to stop it which is bad enough, but later
gc_start_the_world does which is even worse => total confusion (perhaps this
is triggered by the above test)
4) if a gc hits after a thread is cloned, but before it is
arch_os_thread_init'ed we get a nice memory fault or two (why?). If all else
is fixed this can be triggered by:
(progn
(defun waste (&optional (n 100000))
(loop repeat n do (make-string 16384)))
(defparameter *aaa* nil)
(loop for i below 10 do
(format t "LOOP:~A~%" i)
(force-output)
(sb-thread:make-thread
#'(lambda ()
(let ((*aaa* (waste)))
(waste))))
(let ((*aaa* (waste)))
(waste))
(gc)))
Although my understanding of sbcl, signals, gc and threading is patchy at
best, the attached patch attempts to fix these problems by:
- threads start in STATE_STARTING
- create_thread holds a thread_start_lock until the started thread enters
STATE_RUNNING
- gc_stop_the_world acquires and gc_start_the_world releases thread_start_lock
- gc_start_the_world waits until all threads leave STATE_STOPPED
- test for pid 0 before sending a signal
The test forms do not fail anymore, my paserve app runs better and the
threaded tests of cl-ppcre fail later :-(.
Cheers, Gabor

Thread view

Threaded SBCL 0.8.21 locks up on:
(progn
(defun waste (&optional (n 100000))
(loop repeat n do (make-string 16384)))
(loop for i below 10 do
(format t "LOOP:~A~%" i)
(force-output)
(sb-thread:make-thread
#'(lambda ()
(waste)))
(waste)
(gc)))
It is mostly because the delivery of SIG_STOP_FOR_GC is unreliable. There are
a few problems:
1) a thread struct with pid 0 may be on all_threads, and a signal delivered to
pid 0 is sent to every process in the process group of the current process
(not sure it was triggered)
2) gc_start_the_world does not wait for the signaled thread to process
SIG_STOP_FOR_GC => the signal may queue up and thread->state gets out of sync
with reality => gc confusion (this is not likely to happen either)
3) threads start up in STATE_STOPPED. If gc_stop_the_world sees this state it
does not send a signal to stop it which is bad enough, but later
gc_start_the_world does which is even worse => total confusion (perhaps this
is triggered by the above test)
4) if a gc hits after a thread is cloned, but before it is
arch_os_thread_init'ed we get a nice memory fault or two (why?). If all else
is fixed this can be triggered by:
(progn
(defun waste (&optional (n 100000))
(loop repeat n do (make-string 16384)))
(defparameter *aaa* nil)
(loop for i below 10 do
(format t "LOOP:~A~%" i)
(force-output)
(sb-thread:make-thread
#'(lambda ()
(let ((*aaa* (waste)))
(waste))))
(let ((*aaa* (waste)))
(waste))
(gc)))
Although my understanding of sbcl, signals, gc and threading is patchy at
best, the attached patch attempts to fix these problems by:
- threads start in STATE_STARTING
- create_thread holds a thread_start_lock until the started thread enters
STATE_RUNNING
- gc_stop_the_world acquires and gc_start_the_world releases thread_start_lock
- gc_start_the_world waits until all threads leave STATE_STOPPED
- test for pid 0 before sending a signal
The test forms do not fail anymore, my paserve app runs better and the
threaded tests of cl-ppcre fail later :-(.
Cheers, Gabor

On Thursday 07 April 2005 18:26, G=C3=A1bor Melis wrote:
> The test forms do not fail anymore, my paserve app runs better and the
> threaded tests of cl-ppcre fail later :-(.
I have cleaned up the patch a little: if no threads can be started, then=20
gc_{stop,start}_the_world might as well forget about defending against new=
=20
threads being linked onto all_threads. It was tested against 0.8.21.23 (wit=
h=20
Nikodemus's finalization fixes). It now finishes the threaded cl-ppcre test=
s=20
and runs my paserve app with 60 simultaneous clients without lockups and at=
a=20
reasonably stable pace (6-7s/1000 request) which is a great improvement ove=
r=20
the previous value of 6-150s plus lockups.