Attached the syslog of the most recent spontaneous reboot during "jam -q haiku-image".

Btw, after the crash the filesystem seems to be in a weird state, even after running checkfs, I can't zip the syslog after first cp'ing it into my home directoy. The system panics complaining about the fact, that the vnode already exists.

I observe the same symptoms regularly and not only during building Haiku. Sometime it occurs during building other programs like vim or mc, sometimes during performing configure scripts of mentioned programs. At least one of three builds ends with this "unrequested" reboot.

Note that for about of week working under Haiku GCC4 I have not observed this behavior at all. Looks like this is only GCC2 version issue. At least on my hardware. ;-)

Please give hrev32073 a try. It doesn't fix the underlying issue, but it might prevent the triple fault (which such a reboot is) by better handling the double fault. Ideally you'll now be thrown into a functional KDL.

I have yet to try with your recent changes, but I have narrowed it down on my side to a workaround that fully removes the double/triple faults for me. If I change "vm_translation_map_arch_info::Delete()" in ​arch_vm_translation_map.cpp to always use the deferred_delete instead of the direct delete, the faults do not occur anymore. Is it possible that the translation map that is being freed there can still be in use? In that case overwriting it with deadbeef would explain everything going toast.

I have yet to try with your recent changes, but I have narrowed it down on my side to a workaround that fully removes the double/triple faults for me. If I change "vm_translation_map_arch_info::Delete()" in ​arch_vm_translation_map.cpp to always use the deferred_delete instead of the direct delete, the faults do not occur anymore. Is it possible that the translation map that is being freed there can still be in use?

I don't see how that could happen. The vm_translation_map_arch_info objects are ref-counted. And the ref-counting scheme is extremely simple (feel encouraged to review):

The translation map creating the arch info objects owns the initial reference and frees it in destroy_tmap().

A CPU using an arch info has a reference to it. The CPUs' initial references to the kernel translation map arch info are acquired in arch_cpu_init_post_vm(). When the arch info changes (in arch_thread_context_switch()) the reference of the old arch info is released, and one acquired for the new arch info.

I'm sorry to tell, but even with an updated kernel including all your changes it still triplefaults :-(. Tracing output from just before the reboot clearly shows that the structures the scheduler uses are corrupted (as was to be expected). I will now try to review the translation map issue. Of course it's possible that something leading up to there messes up.

I'm sorry to tell, but even with an updated kernel including all your changes it still triplefaults :-(.

You could add a while (true); at the beginning of x86_double_fault_exception() (in arch_int.cpp) to verify that the double fault handler is taken at least.

Unfortunately there has to be some trade-off between safely catching the double fault and still being able to get useful info in the kernel debugger (respectively being able to enter the kernel debugger at all). If the basic VM, CPU, ICI, or kernel debugger structures have been corrupted, the odds are that a double fault will end in a triple fault or an infinite exception loop. With some more work we could push the limit a bit further. Given how annoying triple faults are to debug that might even be worth it.

Tracing output from just before the reboot clearly shows that the structures the scheduler uses are corrupted (as was to be expected). I will now try to review the translation map issue. Of course it's possible that something leading up to there messes up.

Yeah, e.g. corrupted/deleted thread or team structure could theoretically cause any kind of damage, though usually things just crash earlier and without double-faulting in such a case.

Fixed in hrev32118. I hope I described it well enough in the commit message. In any case it would be possible to fix this in different ways. For example it would be possible to cause it to explicitly set the kernel page dir in the currently unused arch_vm_aspace_swap() function. Or it would be possible to simply read out cr3 on deletion and reset it to the kernel page dir when the about to be deleted page dir is detected (that's how I debugged this issue in the end). Feel free to suggest/implement other solutions as you see fit.

One could think to move the interrupt disabling/enabling into the assembly code as well - at least other architectures don't need to call this with interrupts disabled, and it would also be slightly faster, too (depending on the compiler, that is).

One could think to move the interrupt disabling/enabling into the assembly code as well - at least other architectures don't need to call this with interrupts disabled, and it would also be slightly faster, too (depending on the compiler, that is).

Well, both calls take place from architecture dependent code, so other archs aren't affected. I thought about disabling interrupts from the assembly code, but seeing that it is more than just a line and disable_interrupts()/restore_interrupts() being implemented as inlined inline assembly functions I thought it wasn't really necessary to duplicate it.