Description

My system consist of a AMD-FX 6300 6-core machine with 16GB DDR3 RAM and Haiku is installed os primary OS.
The machine randomly reboots and/or freezes (this is more seldom) after a few minutes, when the system is under heavy load (compiling a large software project, using make -j 6).

I replaced the motherboard, CPU, RAM, power-supply etc, but things kept unstable.
Finally, after playing with the safe mode settings I found a work-around: Enabling 4gb_memory_limit.
(This was a suggestion from korli on ticket #10279).

In src/system/kernel/arch/x86/arch_vm_translation_map.cpp a decision is made based on this setting.
PAE paging is disabled, and 32 bit paging is enabled.
The paging method is switched from X86PagingMethodPAE to X86PagingMethod32Bit.

Using the latter paging method, my system is stable: no random reboots, no random binary crashes, nothing.
I'm confident that the actual memory limit of 4 GB is not curing the problem, as I can reproduce the system reboots also when using only one RAM stick with 4 GB alone --> it must be the paging.

NOTE: Other people reported that the 4gb_memory_limit does not help with the random binary crashes (as stated on ticket #10279, by kallisti5), but in my case it does make the difference.
Apparently in earlier days, the 4gb_memory_limit did NOT deactivate PAE paging, but nowadays it does.

Change History (39)

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

One big difference between 32bit and PAE is the size of page_directory_entry and pae_page_table_entry (32 vs 64 bits). Now, these entries on x86 are not updated atomically, which might or might not be an issue (I heard that PAE usually requires cmpxchg64). It's only a hunch.

/*static*/ void
X86PagingMethodPAE::PutPageTableInPageDir(pae_page_directory_entry* entry,
phys_addr_t physicalTable, uint32 attributes)
{
*entry = (physicalTable & X86_PAE_PDE_ADDRESS_MASK)
| X86_PAE_PDE_PRESENT
| X86_PAE_PDE_WRITABLE
| X86_PAE_PDE_USER;
// TODO: We ignore the attributes of the page table -- for compatibility
// with BeOS we allow having user accessible areas in the kernel address
// space. This is currently being used by some drivers, mainly for the
// frame buffer. Our current real time data implementation makes use of
// this fact, too.
// We might want to get rid of this possibility one day, especially if
// we intend to port it to a platform that does not support this.
}

My attempt was to simply call atomic_get_and_set64 in PutPageTableInPageDir(), but your cleanup is indeed nicer.
Unfortunately I couldn't test it yet - due to #13980, which currently makes it impossible for me to compile Haiku.

I'm trying waddlesplashs attempt to fix it, together with your patch - hopefully I have a new build available in the next few hours, and can confirm whether the PAE problem is fixed or not.

I've further analyzed this, and checked all locks / CPU pinning are done correctly, by comparing to the 64bit implementation.
There is only one really interresting difference: X86VMTranslationMapPAE::QueryInterrupt().
In the 64bit paging implementation, QueryInterrupt() simply calls Query() wheres in the PAE code a specific implementation exists.
The Query() code for both 64bit and PAE, enforce thread cpu pinning, but the QueryInterrupt() code for PAE doesn't do that.

As I said before, I'm not expert in this area, and can't judge whether this is problematic or not, but just wanted to mention this difference.
I've observed that I always see "hda: Unsolicited response" in the syslog before the machine crashes, and that message comes from the HDA audio driver IRQ handling.

What do the experts think? Could a missing ThreadCPUPinner in the QueryInterrupt() function cause such problems?

No idea about QueryInterrupt(). About the HDA audio driver, you could blacklist it to check whether it could be a source of problems. This can actually be done with a few drivers. A warning though, audio or network drivers run continuously and tend to exercise the system, that doesn't mean that they are at fault (like for random crashes).

Some news on this item: The system is completely unstable (random reeboots) with PAE paging enabled, and stable with 32bit paging.
By now I've replaced the RAM, played with memory settings in BIOS, but it all doesn't help.

I went ahead and installed a x86_64 Haiku version using 64bit paging, and this is even more unstable.
The system randomly reboots after a few minutes, even when not doing heavy compilation work. Copying a file via scp, or browsing with Web+ is already sufficient to trigger a reboot -- as always nothing visible in the syslog with any hint.
Sometimes I saw a corrupted imagine on the display, eg. a checker board pattern, or just a plain color, which made me initially think it could be related to the graphics card.

Maybe a stupid question, but still: My machine uses an onboard radeon GPU, and the RAM is shared according to the BIOS settings (e.g. 256MB is mapped for the GPU). Does Haiku know about that? Do we have any means to detect that a certain portion of the RAM is dedicated for the GPU?

You can check the ram size reported by the os (for example in AboutSystem). it should either reduce the total size there or report some "inaccessible" RAM. If it doesn't, indeed we have a problem.

Thanks for the suggestion. The inaccessible RAM stays at 1 MiB independent of the frame buffer size I chose in the BIOS, but the actual available RAM reflects the BIOS settings --- also the syslog tells me that the radeon_hd driver found the correct frame buffer size - corresponding to the BIOS settings.

New theory: I start to believe that this is an intrinsic AMD-FX instability, that can be cured with microcode updates, as the system is stable under Linux - which include ucode update facilities.
Haiku lacks the ability to update the CPU microcode.

I've started to port the Linux microcode updating facilities to Haiku, with partial success.
According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

There are a few possibilities:

my limited knowledge of this topic introduced a subtle bug, and the microcode is not correctly updated on all CPUs (e.g. eventually I've not properly protected/isolated the individual CPU updates from each other, not sure if that's actually needed, or whether x86_write_msr, can be safely done as-is

I update the ucode too late/early in the booting process

Outdated microcode is not the issue for my random-reboot-issue

TODO:

Microcode updating only works for AMD 0x15 CPUs, simply because I didn't bother to extend to anything else for now.

Not sure how to "package" the ucode blobs so that they're available early in the boot process -- for now I've hexdumped the amd-ucode.bin file, and hardcoded it into the kernel - an ugly workaround for now.

I've started to port the Linux microcode updating facilities to Haiku, with partial success.
According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

Sorry to read that. Is there maybe a pattern when the crashes don't happen? Did you try to blacklist all your unneeded drivers (for instance USB ehci,xhci)? I know the crashes also exist on Intel.

There are a few possibilities:

my limited knowledge of this topic introduced a subtle bug, and the microcode is not correctly updated on all CPUs (e.g. eventually I've not properly protected/isolated the individual CPU updates from each other, not sure if that's actually needed, or whether x86_write_msr, can be safely done as-is

One thing missing is re-reading the CPU features after the microcode update.

I update the ucode too late/early in the booting process

Outdated microcode is not the issue for my random-reboot-issue

TODO:

Microcode updating only works for AMD 0x15 CPUs, simply because I didn't bother to extend to anything else for now.

Not sure how to "package" the ucode blobs so that they're available early in the boot process -- for now I've hexdumped the amd-ucode.bin file, and hardcoded it into the kernel - an ugly workaround for now.

I imagine the vendor specific code would be better placed in src/add-ons/kernel/cpu/x86/amd.cpp (for AMD). Better confirm that before going forward with ucode update.

I've started to port the Linux microcode updating facilities to Haiku, with partial success.
According to the syslog the microcode is now properly updated, to the latest version that's available for my AMD Bulldozer (0x15) -- but the crashes that I observe on x86_64 remain.

Sorry to read that. Is there maybe a pattern when the crashes don't happen? Did you try to blacklist all your unneeded drivers (for instance USB ehci,xhci)? I know the crashes also exist on Intel.

I already disabled all USB busses, the HDA driver, radeon_hd, etc. doesn't help.
Also the 4GB RAM limit doesn't help (whereas on x86-gcc2, this allows me to have a stable system).
Disabling SMP helps in both cases to get a stable system.

Both Intel and AMD manuals state that without invalidation, even if the old entry is not cached in the TLB, the pagewalk may still see the old entry. That is, in the following pseudocode, the second instruction (load) can non-deterministically use either the old mapping or new mapping, and that there must be an invalidation or TLB flush in between to guarantee the new page table entry is visible by the second instruction.
mov [page table], new_mapping
mov eax, [linear address using updated mapping]

I wonder if the Haiku paging code guarantees this.
Apparently we only call InvalidatePage(..) when the accessed bit is set.

if ((oldEntry & X86_64_PTE_ACCESSED) != 0) {
// Note, that we only need to invalidate the address, if the
// accessed flags was set, since only then the entry could have been
// in any TLB.
InvalidatePage(address);
..

X86VMTranslationMap::InvalidatePage() only records that a page was invalid, and upon the next Flush() the TLB is invalidated, when fInvalidPagesCount > 0.

I'm currently traveling, when I'm back I will try to always call invalidatePage() in any case, just to see if it helps with my issue.
Obviously I'm still 'fishing in the blind' - trying to understand what makes AMD Bulldozer special.

I looked some other implementations, and they don't seem to work differently from ours.
This article speaks of page table mapping update. The AMD PDF states that the flag X86_64_PTE_ACCESSED is set when speculation occurs. Interesting information, anyway I suppose you can try whatever would make sense and prove a theory by testing.

I found out a super easy way to trigger the reboots:
find / | xargs md5sum
(repeat < 5 times)

Out of curiosity I went back to old hrevs. To make a long story short: hrev45681 reproducible reboots spontaneously, and hrev45225 does not. Unfortunately there are no nightlies in-between, so I will have to check all patches between 45225 .. 45681 that touch the x86_64 kernel and see if I can find the culprit: fun ahead of me.

As bisecting is probably too difficult here, reverting patches individually may be of more use.

I started with the same approach, inspecting the git log between hrev45225 ... hrev45681, identifying problematic patches, etc. - my list is even more complete than yours - but reverting them individually is in some cases problematic :-( I ended up with a kernel that didn't even boot, entering KDL early in the process. Reverting some patches created a lot of subtle fixup work in various places (e.g. removing B_RANDOMIZED_ANY_ADDRESS support..) which took a hour, and once it compiled and booted properly, I could still easily trigger the reboot.

I decided to give up on this, and started again with reproducing the nightly builds from 5 years ago, from an older linux host system, by utilizing vanilla RHEL6 images in a singularity container, mimicking a standard host system for the cross-compilation from 2013.

I now have the btrev<XY> from 2013 running, and the old hrevs as well. By now, I successfully reproduced the random reboot with a self-compiled hrev45681. My bisect is currently at hrev45560, and this also does not produce random reboots.

It would be interesting to see if you can get any of these issues to reproduce on KVM with AMD-V enabled. That way the page tables should be mostly going through the real MMU and not QEMU's MMU emulation, and then if you can trigger triple faults using this it may be much easier to debug.

The issue in #14659 might be relevant here. The instant resets hint at a tripple fault, i.e. the double fault handler failed. Due to the bug fixed in ​https://review.haiku-os.org/#/c/haiku/+/810 the double fault IST would be cleared for most CPUs. The reason for the double fault itself would not be explained by that bug, but if the fix makes the double fault handler work again it might shed some light on the actual cause. It is therefore worth a try to retest this with change 810 applied.