Re: [Sbcl-devel] MacOSX Thread Failures

Hi Juho,
I wonder if this would help to turn up issues? Either way I'm running into another problem: The change I made allows the tests to run further. threads.impure.lisp is now hanging. I've verified that the following chunk in the test hangs whether I have my change in it or back it out (Right now I've put it in a separate file to isolate it out for debugging, the extra format statements are my own debugging inclusions):
(in-package "SB-THREAD")
(use-package :test-util)
(use-package "ASSERTOID")
(defun wait-for-threads (threads)
(mapc (lambda (thread) (sb-thread:join-thread thread :default nil)) threads)
(assert (not (some #'sb-thread:thread-alive-p threads))))
(defun alloc-stuff () (copy-list '(1 2 3 4 5)))
(progn
(let ((thread (sb-thread:make-thread (lambda () (loop (alloc-stuff))))))
(let ((killers
(loop repeat 4 collect
(sb-thread:make-thread
(lambda ()
(loop repeat 25 do
(sleep (random 0.1d0))
(princ ".")
(force-output)
(sb-thread:interrupt-thread thread (lambda ()))))))))
(wait-for-threads killers)
(format t "past wait-for-threads~%")
(sb-thread:terminate-thread thread)
(format t "past sb-thread terminate-thread~%")
(wait-for-threads (list thread))
(format t "past wait-for-threads (list thread)")))
(sb-ext:gc :full t))
The problem is it gets stuck in (wait-for-threads killers), which (as far as I can tell at the moment but I could be way wrong) is stuck in with-system-mutex. It kind of looks like the expected failures of the tests when running them against a :SB-THREAD enabled MacOS build might have been hiding this issue. I tried building an older version of SBCL (1.0.15) on my Snow Leopard system but I'm running into all kinds of errors. I wanted to see if 1.0.15 runs into the same issues.
The reason that I'm seeing this is that threads.impure.lisp used to error out way before it got to this point.
I'm not quite stuck yet, but if anyone has any suggestions or pointers I'm all ears :-)
Thanks,
Glenn
V. Glenn Tarcea
gtarcea@...
On Feb 27, 2010, at 9:36 PM, Juho Snellman wrote:
> Maybe a better solution would be to switch OS X (or all platforms?) to run the impure tests with run-program rather than fork off the current sbcl. There should already be code in the test framework for this, currently only enabled for win32.
>
> --
> Juho Snellman

On 27 February 2010 06:11, Glenn Tarcea <gtarcea@...> wrote:
> I've been trying to track down some of the failing tests on MacOSX with :sb-thread enabled.
>
> Right now clos-add-remove-method.impure.lisp always fails for me. I have it narrowed down to always failing if I run the following tests:
> run-tests.sh threads.pure.lisp clos-add-remove-method.impure.lisp
>
> I've modified threads.pure.lisp to consist only of the following code:
> With just the above being run in threads.pure.lisp and clos-add-remove-method.impure.lisp untouched I am always able to reproduce the failure. The error is:
>
> ========================================================
> fatal error encountered in SBCL pid 3362(tid 41989120):
> GC invariant lost, file "thread.c", line 205
This is a known issue. The problem is that on OS X we cannot safely
fork() after having spawned threads.
The minimal case to reproduce is to put just
(sb-thread:make-thread (lambda ()))
in threads.pure.lisp. The way tests are run is that "pure" files are
all run in the same image, after which each "impure" file is run by
forking beforehand.
If you try eg. renaming threads.pure.lisp -> threads-1.impure.lisp,
you should be able to run most of the tests on OS X as well.
Cheers,
-- Nikodemus

On 27 February 2010 15:27, Glenn Tarcea <gtarcea@...> wrote:
> Thanks for the response. Why does a fork after spawning threads cause an issue on MacOSX?
> That is, not the internals to macos, but what issues do we see?
I added some mailing list archive links to the relavant bug on Launchpad:
https://bugs.launchpad.net/sbcl/+bug/310208
Reading them should clarify the issue a bit -- but independent
investigation is always good too, maybe my analysis is bogus?
Cheers,
-- Nikodemus

Thanks for the pointers. I'm going to poke around more to see what I can discover as well. I work mostly on MacOSX and would like to understand the issues around using threads in SBCL on it.
Thanks,
Glenn
V. Glenn Tarcea
gtarcea@...
On Feb 27, 2010, at 8:44 AM, Nikodemus Siivola wrote:
> On 27 February 2010 15:27, Glenn Tarcea <gtarcea@...> wrote:
>
>> Thanks for the response. Why does a fork after spawning threads cause an issue on MacOSX?
>> That is, not the internals to macos, but what issues do we see?
>
> I added some mailing list archive links to the relavant bug on Launchpad:
>
> https://bugs.launchpad.net/sbcl/+bug/310208
>
> Reading them should clarify the issue a bit -- but independent
> investigation is always good too, maybe my analysis is bogus?
>
> Cheers,
>
> -- Nikodemus
>
>

I already submitted bug 528796 to Launchpad
https://bugs.launchpad.net/bugs/528796
but it was suggested I mention it here to help people having problems
trying to run MCLIDE 1.0b2 (a Lisp IDE on Mac OS X) with sbcl.
The symptom is that you can't run MCLIDE if you try to specify the sbcl
executable. The unobvious workaround is that in the startup dialog you
need to either
1) specify run-sbcl.sh instead of the binary executable, e.g.,
/usr/local/bin/sbcl
or
2) edit the file
~/Library/Preferences/com.in-progress.mclide/lisp-implementations-4
and in the line for sbcl remove the 'sh' in the command line that begins
with sh \"$exec\"
I believe that #2 has already been committed to the MCLIDE sources,
which then brings up sbcl bug 528796: For MCLIDE to work with
run-sbcl.sh after that change, run-sbcl.sh has to be marked executable
(e.g., chmod +x run-sbcl.sh) so it can be run from the command line
without preceding it with an 'sh'.
Bug 528796 requests that the shell script files which are not at the
moment marked executable in the CVS repository be made so, or if
Sourceforge makes that difficult to change, at least the scripts that
produce the distribution tarballs be modified to make the files
executable there.
Sidney Markowitz
http://sidney.com

I think I have an "idea" what is going on - or at least I can get a little further. In the failure condition I've been seeing what appears to be happening is that when perform_thread_post_mortem() is called it sometimes is being called in the child after a fork. At this point there is no thread (a fork nukes all existing threads except for the thread doing the fork). The perform_thread_post_mortem() then calls: gc_assert(!pthread_join(post_mortem->os_thread, NULL));
Since the thread pointed at by post_mortem->os_thread doesn't exist pthread_join is returning a non-zero value (it looks like errno is probably set ESRCH but I haven't verified that at this point).
Anyway, this causes gc_assert() to abort resulting in the test failure.
To work around this I made two changes:
1. When the thread_post_mortem structure is first created I added a new field called origpid and set it to getpid().
2. When perform_thread_post_mortem() is called it calls getpid() and compares that to origpid. If they are different then I don't bother calling any of the pthread_* routines in perform_thread_post_mortem(). I do however free up the post_mortem, and call os_invalidate(). I'm not sure if this is entirely the correct thing to do.
The upshot is that I get further in the tests, but now I'm running into a consistent hang in threads.impure.lisp. So... At this point I'm not sure if my "fix" has introduced other issues or if things are getting farther along and new issues are coming up.
So, my plan right now is: I'm backing my change out, rebuilding and I'm going to run threads.impure.lisp by itself and see what I can discover. I'll probably put my change back in a slightly different way when I've had a chance to go through the code in this area a little more (I need to verify that calling things like os_invalidate() when I have a non-existent thread is the right thing to do).
Thanks,
Glenn
V. Glenn Tarcea
gtarcea@...
On Feb 27, 2010, at 9:11 AM, Glenn Tarcea wrote:
> Thanks for the pointers. I'm going to poke around more to see what I can discover as well. I work mostly on MacOSX and would like to understand the issues around using threads in SBCL on it.
>
> Thanks,
>
> Glenn
>
> V. Glenn Tarcea
> gtarcea@...
>
> On Feb 27, 2010, at 8:44 AM, Nikodemus Siivola wrote:
>
>> On 27 February 2010 15:27, Glenn Tarcea <gtarcea@...> wrote:
>>
>>> Thanks for the response. Why does a fork after spawning threads cause an issue on MacOSX?
>>> That is, not the internals to macos, but what issues do we see?
>>
>> I added some mailing list archive links to the relavant bug on Launchpad:
>>
>> https://bugs.launchpad.net/sbcl/+bug/310208
>>
>> Reading them should clarify the issue a bit -- but independent
>> investigation is always good too, maybe my analysis is bogus?
>>
>> Cheers,
>>
>> -- Nikodemus
>>
>>
>
>
> ------------------------------------------------------------------------------
> Download Intel&#174; Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Sbcl-devel mailing list
> Sbcl-devel@...
> https://lists.sourceforge.net/lists/listinfo/sbcl-devel
>
>

Maybe a better solution would be to switch OS X (or all platforms?) to run
the impure tests with run-program rather than fork off the current sbcl.
There should already be code in the test framework for this, currently only
enabled for win32.
--
Juho Snellman

Hi Juho,
I wonder if this would help to turn up issues? Either way I'm running into another problem: The change I made allows the tests to run further. threads.impure.lisp is now hanging. I've verified that the following chunk in the test hangs whether I have my change in it or back it out (Right now I've put it in a separate file to isolate it out for debugging, the extra format statements are my own debugging inclusions):
(in-package "SB-THREAD")
(use-package :test-util)
(use-package "ASSERTOID")
(defun wait-for-threads (threads)
(mapc (lambda (thread) (sb-thread:join-thread thread :default nil)) threads)
(assert (not (some #'sb-thread:thread-alive-p threads))))
(defun alloc-stuff () (copy-list '(1 2 3 4 5)))
(progn
(let ((thread (sb-thread:make-thread (lambda () (loop (alloc-stuff))))))
(let ((killers
(loop repeat 4 collect
(sb-thread:make-thread
(lambda ()
(loop repeat 25 do
(sleep (random 0.1d0))
(princ ".")
(force-output)
(sb-thread:interrupt-thread thread (lambda ()))))))))
(wait-for-threads killers)
(format t "past wait-for-threads~%")
(sb-thread:terminate-thread thread)
(format t "past sb-thread terminate-thread~%")
(wait-for-threads (list thread))
(format t "past wait-for-threads (list thread)")))
(sb-ext:gc :full t))
The problem is it gets stuck in (wait-for-threads killers), which (as far as I can tell at the moment but I could be way wrong) is stuck in with-system-mutex. It kind of looks like the expected failures of the tests when running them against a :SB-THREAD enabled MacOS build might have been hiding this issue. I tried building an older version of SBCL (1.0.15) on my Snow Leopard system but I'm running into all kinds of errors. I wanted to see if 1.0.15 runs into the same issues.
The reason that I'm seeing this is that threads.impure.lisp used to error out way before it got to this point.
I'm not quite stuck yet, but if anyone has any suggestions or pointers I'm all ears :-)
Thanks,
Glenn
V. Glenn Tarcea
gtarcea@...
On Feb 27, 2010, at 9:36 PM, Juho Snellman wrote:
> Maybe a better solution would be to switch OS X (or all platforms?) to run the impure tests with run-program rather than fork off the current sbcl. There should already be code in the test framework for this, currently only enabled for win32.
>
> --
> Juho Snellman