I've got similar problem on x86-64 Linux but only with ghc compiled code, not
ghci.

I tried and I can't reproduce the error here on x86-64/Linux (Ubuntu 10.04). I also ran the program with Valgrind and found no errors (reading uninitialized data is a common cause of things that fail on one platform but not another).

So I'm stumped, but we should treat this as a high priority. It could be related to #4303. Ian, could you try to repro on your Mac?

I've got similar problem on x86-64 Linux but only with ghc compiled code, not
ghci.

I tried and I can't reproduce the error here on x86-64/Linux (Ubuntu 10.04). I also ran the program with Valgrind and found no errors (reading uninitialized data is a common cause of things that fail on one platform but not another).

So I'm stumped, but we should treat this as a high priority. It could be related to #4303. Ian, could you try to repro on your Mac?

Just to be sure I've pulled a fresh copy of GHC and just run validate --- as
before test 4038 failed. FYI: I'm running Gentoo, with GCC 4.5.2,
bootstrapping using GHC 7.0.2.

This is odd. On OS X 10.6.6 (64 bit), ghc-HEAD from today still only has 4038 failing under ghci.
Perhaps I'll try valgrind as well, to see if I can get a clue why we're running off the end of the stack.

Has anyone observed this error on a 32 bit platform? If not, the underlying bug is probably similar to #4970, in which a 32 bit value was copied to the lower half of a 64 bit location, without the upper half being zeroed or sign extended. Unfortunately, the original error may happen some time before the ultimate crash.

I guess the good news is that this doesn't seem to be an OS X linker bug.

I've now seen the bug in the compiled 4038.hs. It's much easier to see what's going in this case.

What's happening is that we're simply blowing past the end of our process's C stack. I can see this in gdb. I've set a breakpoint at stg_makeStablePtrzh, since this is the function in which the memory error occurs in the compiled code. When the breakpoint is first encountered, the $rsp (the C stack pointer) is at 0x7fff5fbfb530. When the crash occurs, $rsp is pointing to 0x7fff5f3ffa70:

Oh that makes sense. On x86-64, RESERVED_C_STACK_BYTES = 2048 * sizeof(SIZEOF_LONG) = 2048 * 8 = 16384 = 16k. Dividing 8192/16 = 512, which is roughly the number of iterations the program runs before crashing. Except they're not loop iterations, they are nested C calls which each get a whopping 16k of stack. So wouldn't this be expected behavior?

I'll hypothesize that the reason the test doesn't fail on x86 is that sizeof(SIZEOF_LONG) = 4, which means half as much of the stack is eaten by each call. So the 4038 test should also fail on a 32-bit machine when modified with main = f 2000 >>= print.

More information. Great chunks of stack are being gobbled by StgRunIsImplementedInAssembler (in rts/StgCRun.c). I set a watchpoint on $rsp an ran it between two breakpoints of stg_makeStablePtrzh. Here's the interesting piece:

It doesn't seem like there is a bug here at all --- if we do recursive calls of C functions through the ffi, we will eventually run out of stack space. I suppose I knew but didn't really believe that 16 kB were allocated off the stack for each C call.
I'll ask if there is some subtlety here. I don't understand why the test passed earlier on OS X 64 bit, only failing for ghci. 1000 iterations should have almost 16 MB onto the stack, twice the allocated space. Why didn't that fail?

If the fix is to reduce the number of times f is called in the 4038.hs,
I will submit a patch to do that.

The 0x4038 is right (although I'm not sure why it needs to be so large as described in StgRun.c). It comes from RESERVED_C_STACK_BYTES + 48 + 8 = 16384 + 48 + 8 = 16440 = 0x4038.

Yes, we seem to be working on this at the same time.

The mystery is why this ever worked. Test 4038 only failed for ghci when I was doing tests to verify my linker patches. Was there simply more than 8 MB of valid memory in the stack area, or is something else going on?

I'm certain I also saw 4038 pass the non-ghci tests, but now it fails when doing a manual run of the test. I'm thinking that there's some kind of optimization that somehow gets rid of the nested calls, and we're missing it now because we're using debug builds. Of course, that doesn't explain why the tests pass on linux x86-64.

I may have unraveled part of that. Unless I'm missing something, it seems the testsuite is reporting invalid passes. Running make TEST=4038, 6 tests pass and 1 fails. When I run the passing tests manually (just as they appear in the output of make TEST=4038) from the command line, every one of them segfaults.

I may have unraveled part of that. Unless I'm missing something, it seems the testsuite is reporting invalid passes. Running make TEST=4038, 6 tests pass and 1 fails. When I run the passing tests manually (just as they appear in the output of make TEST=4038) from the command line, every one of them segfaults.

I tried it again on OS X 10.6.6 this morning. Now every 4038 test fails:

Does a stack overflow have to lead to a bus error? Couldn't it lead to "C stack overflow" or something?

Simon

Plain old C just segfaults when you run off the end of the stack. An 8 MB block is allocated at the top of the virtual address space, and accesses below (top - 8 MB) are usually to unallocated memory, generating a segfault. (The .NET framework's fancier runtime that can check the stack pointer and throw a stack overflow exception.)

It's still strange that the test has sometimes worked, particularly when compiled rather than interpreted. I'm wondering if in addition to the stack space there's additional space allocated at high virtual addresses. Then when we run 4038 we access memory we shouldn't, but since we're not writing much to the stack, the test (and process) end before the resulting memory corruption causes a crash.

I'm scratching my head over this one, but I'm happy that this isn't
another OS X linker bug.

16KB is the area we allocate for spilling temporary values during execution of Haskell code. In the future we might be able to use the ordinary Haskell stack, but that's a long way off. It is 16KB because sometimes GHC encounters a particularly large basic block that needs a lot of temporary storage.

So it's still a mystery why sometimes the test doesn't fail. Perhaps the memory below the stack is allocated to another thread, and as Greg suggests we might get away with it because we're not writing to a part of the stack that the other thread is using.

The attached patch reduces the number of nested calls in test 4038 to 300 and explains why. It is a bit interesting that when I used 400 calls I still had a segfault when running the ghci test. I guess ghci grabs more of the C stack for other purposes.

So it's still a mystery why sometimes the test doesn't fail. Perhaps the memory below the stack is allocated to another thread, and as Greg suggests we might get away with it because we're not writing to a part of the stack that the other thread is using.

That would explain why the tests pass consistently in my testsuite but fail when I run them from the shell. Should we open a new ticket for the testsuite script with the goal of being able to better detect bad memory accesses?