29 April 2013

Update 4: I found the receipts for one pair of sticks and took it to MSY in Melbourne -- they were replaced on the spot without any questions asked. Very happy.Update 3: The errors were all due to 3 bad ram sticks. Using the only good stick everything works fine. That's 24 Gb of bad ram...this won't be cheap if I can't find the receipts...Update 2:
Running memtest86 I caught lots of errors (51 in 50 minutes) before I killed the test. I'm currently testing each stick one by one. I'm hoping that what is seemingly RAM errors can be caused by inapproriate BIOS settings, because 32 Gb bios is not cheap to replace...

While I'm swapping RAM sticks I'm also testing a separate set of stick on a different box. If they are error free it will be interesting to see if they trigger errors on the troublesome node. I'm still hoping for BIOS as being the culprit...

So far threeout of four tested sticks have shown errors -- they all happen during test #6. The fourth stick has passed all tests seven times.

I purchased This MB to run with the AMD FX 8150. I have built computers from high end to low end and know the ones in the middle last the longest and are the most stable.
[..]
At this point the fun of the build is gone, and I have too many hours dealing with problems.

And that's not the only negative FX8?50 + 990FX review.

The worst part of it is that I've been thinking about building another, identical node (good value for money) as well as recommending my build to a student whom is about to do calcs.

Mind you, I've only ever had issues when it comes to compiling the kernel -- it's been solid when it comes to running calculations.

Original post:

NOTE: this is NOT a solution. Just observations.

My AMD FX 8150 is a great CPU -- it makes up the heart of the fastest of my computational nodes, and is eminently affordable. It does, however, cause me grief in one respect -- I can't compile the linux kernel.

The fact that the errors keep changing might also be pointing towards there being a hardware fault with my CPU, rather than with FX 8150 in general.

3.8 built fine twice, and crashed the third time. 3.8.10 crashed twice, then built fine the third time.

It all sounds like I'm having hardware issues...but they only seem to be triggered during kernel builds. During 'normal use (i.e. using 100% cpu for weeks at a time) it is perfectly stable. Compiling e.g. nwchem (another pretty heavy compile) also goes absolutely fine.

Troubleshooting something like this also wouldn't be easy. See the end of the post for a list over various errors that I was getting during compilation of different kernel versions.

I unzipped it with 7z, giving me 990FXAD3.F8 -- I then put that file in the root of a USB stick..

I've tried with a number of USB sticks, including a blank stick formatted with W95 Fat32 and keeping the stick plugged in before rebooting.

In Q-flash, I always ended up with a prompt saying Floppy A <Drive>, and when I hit enter it says '.. <dir>'. 0 Files found. Yet it also said Total size 7.48G, Free Size: 7.44 G, which matched the size of the USB stick.

Finally I managed to get it to work:
* in fdisk I only created a 1 gb partition on the USB stick, set type (t) to 6 (Fat16), made it bootable, and wrote changes to disk.
* I then ran mkdosfs -F 16 /dev/sdb1 (my usb stick was /dev/sdb).
* I then copied the 990FXD3.F8 file to the usb stick root (after mounting it of course) and THAT worked.

Memtest86
Because RAM has traditionally been a major culprit behind hardware errors (especially the random, difficult-to-diagnose type) it's always a good idea to run a memtest. To do that, install memtest86+ (sudo apt-get install memtest86+) and reboot. There should be a new menu item (scroll down) in grub. Memtest takes quite a while, especially if you have a lot of RAM (32 Gb...).

I counted 51 errors before killing the test (time to identify the bad stick). Many of these occurred in a more limited address space than those shown above. Sigh...the RAM was the most expensive part of this build...

According to this there's a slight chance that the RAM might be ok, but it's still not a good sign.

I've tested each stick by itself -- so far 3 out of 4 sticks have yielded errors during test 6. I did seven passes on the fourth stick and no errors.

The outcome
However, even with the new bios the kernel compiles still fail -- it takes longer for it to fail, but it fails.
I do see the odd thing in dmesg though:

CC [M] fs/nfs/inode.o
In file included from include/net/scm.h:6:0,
from include/linux/netlink.h:8,
from /home/me/tmp/linux-3.8.10/include/uapi/linux/neighbour.h:5,
from include/linux/netdevice.h:51,
from include/linux/icmpv6.h:12,
from include/linux/ipv6.h:59,
from include/net/ipv6.h:16,
from include/linux/sunrpc/clnt.h:26,
from fs/nfs/inode.c:26:
include/linux/security.h:2581:1: internal compiler error: Segmentation fault
Please submit a full bug report,