gcc segmentation faults on Ryzen / Linux

As a software guy, I compile a lot of code, and occasionally gcc crashes with a segmentation fault for no obvious reason. I seem to remember that the problem also manifested as illegal instruction errors sometimes but I'm not sure about that anymore.. I have a Ryzen 1800X CPU and Asus Prime B350-Plus mainboard with UEFI BIOS 0609 (latest). My RAM is on the QVL and running at 3200 MHz but that shouldn't matter.

I'll summarize it: Different people, different gcc versions, different optimization levels, different software compiled, different RAM clocks including very low ones, different Ryzen models and mainboard models, Some of them tried swapping several pieces of hardware to no avail.

I have little to add: I can reproduce the segfaults on Ubuntu 17.04. And nothing else crashes for me after the latest UEFI + AGESA update.

Mean time between crashes is about an hour when compiling continuously.

I think you should try hard to reproduce and fix this at AMD. Compiling anything on Linux with gcc while using all CPU threads should suffice.

It was not me who opened the service request. Somebody in that Gentoo forum thread apparently did, but it's not public and there were no news when I looked. I wanted to ensure that somebody takes care of it because it seems important.

As a workaround, try to disable either SMT or the uOP cache via the CMOS setup of your mainboard. For your workload the latter will probably give you the smaller performance hit, but I don't know whether that specific setup item is exposed by your ASUS board.

In any case, AMD is most probably already aware of potential underlying issues.

Here it is another victim of this problem, with a ryzen 1600. In Gentoo, just two parallel emerge (f.e gcc in a shell and mesa in another shell) with all core used (-j13) trigger the problem, with the compilation that fails suddenly with (usually) the following text in dmesg

Here SMT or any other BIOS settings doesn't change a thing. Today the ryzen machine compiled 1290 packages and it had 15 segfaults, it means 1 segfault every 86 compilations. So, this machine is "unusable" and it should be a "working machine".

I opened a ticket and I am waiting for some answer, but my main interest is in understanding what is causing this problem: after having tried 4 MBs, 3 RAM kits and after having bought a brand new PSU, such in case the old one was the culprit, I don't know what to do. Is it a CPU problem? Should the users of ryzen CPUs with this problem RMA their CPUs? Is it fixable with an AGESA update? Or there is no solution atm at all?

Not all the BIOS have the OPCache option, f.e. my Gygabyte K7 doesn't. Disabling SMT alleviates the problem, but it doesn't solve it. As well the LLC I cited, which greatly reduces the segfaults, but they are still present.

The problem is not about the time needed to solve the iusse ("patience"), but it is if there will ever be a solution. I am developing a sense of "this is how it is".

I am having exactly the same issues as the original poster (and the numerous others that have posted on the gentoo forum linked in the original message). I do not have an OP Cache setting in my motherboard (MSI X370 Gaming Pro Carbon) so I am not able to disable it. Turning off SMT does not fix the problem.

I have tried various combinations of the following with little to no effect:

While the problem encountered is 'random' segmentation faults in that they do not occur in any fixed memory address or particular part of a compile, the system will very consistently crash / segfault in any highly multi-threaded process that uses a lot of RAM. To reproduce the issue, I simply loop through compiling mesa 17.0 with -j16 and the build directory mounted to tmpfs (i.e. a ramdisk location for the build files). If I make it past 10 minutes without a segfault it's a lucky run.

I can't monitor CPU temperatures within Linux yet, but this does not appear to be heat related - cool ambient temperatures with the case open and a room fan blowing directly into the case did not increase the stability to any noticeable degree (and the CPU temperatures in Windows running prime95 with 16 workers stay reasonable).

Note that this problem is not limited to compilation tasks in Linux - prime95 will throw errors as well. It's just much less frequent (e.g. where compiling mesa in a ramdisk will segfault in minutes, prime95 can go for a few hours before complaining.)

I would really appreciate a response as "This question is Assumed Answered." is not true. The problem exists, and even if disabling SMT "fixed" it (which I'll repeat - it doesn't) that isn't an answer.

EDIT: I forgot to mention I also tried each of my memory sticks (2x8GB) independently without any improvement. If one stick was bad, you would expect to see segfaults with that stick but not the other. Both sticks together and each independently all display the same behaviour. Note that I haven't tried every combination of settings with every permutation of memory installed - just the default settings with the single DIMMs.)

Additionally, memtest86 will run through at least two cycles without error even with the RAM set at 3200MHz.

Since Gentoo is a source-based distribution, a significant amount of time setting up and/or updating the system involves compiling software packages, which increases the impact of this bug significantly for that community and is likely why you see the most comments from Gentoo users. I actually use Ubuntu as my primary OS, but I am able to reproduce the problem most consistently under Gentoo so that is what I have used to try out various BIOS tweaks. That said, to rule out OS-specific issues I did a test compile of gcc under Ubuntu 16.04 and was able to reproduce the problem (again, using make -j16). Under Ubuntu I didn't set the build directory up in a tmpfs mounted file system, but even running from my SSD and not from RAM I still can't consistently get through the full compile. I did once get it to compile twice in a row without a segfault, but that's the exception (and still not acceptable...)

To rule out a "Linux-specific" issue, I ran prime95 with 16 threads under Windows 10 to see if that was stable. As noted earlier it was not (although earlier I failed to mention I was running prime95 in Windows). Prime95 does run successfully for significantly longer than a multi-threaded compile, however, so I have not been using that to test BIOS settings.

To be fair to amdmatt's suggestions, disabling SMT is the one thing that makes the biggest difference for my system stability while compiling. My test compile of mesa-17.0 was able to successfully complete nearly 14 times in a row before crashing with SMT disabled and make reduced to -j8. That said, I still don't feel that this is an acceptable solution - I didn't buy a 16-thread processor to run it with half the threads disabled (and even then not be 100% sure that it's not going to crash, or corrupt my data).