I understand folks don't want to debug processes that are "Tainted". The log above shows one process (568) as "Not Tainted", the other (3339) is "Tainted". Really "dd" is Tainted? Or I'm just interpreting this wrong?

CONFIG_HZ_1000=y is known to cause problems on some hardware. Its not need on a headless system either.
Try 100Hz instead.

You also have several debug options on in your kernel, I did not check them all. Debug options always cause logspam and sometimes interfere with normal operation.
Debug options should only be on if you are debugging that part of the kernel.

While you are fixing your kernel timer, turn off all the debug stuff too._________________Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Not having swap does not stop the kernel swapping, it just robs the kernel of the ability to move dynamically allocated RAM to disk.
The kernel will still swap by discarding from RAM data or code that has a permanent home in disk, then reloading it when its needed again.
Unless you are running a diskless node, a small swap, say 512Mb, is a good thing.

You can make a swap file if you want to test your swap theory but I agree, no swap is unlikely to be the problem._________________Regards,

NeddySeagoon

Computer users fall into two groups:-
those that do backups
those that have never had a hard drive fail.

Neddy - testing/swapping memory was one of the first things I did. No change. Plus memtest86+ ran for 36 hours with no reported issues. I've got two DIMMS, tried running with just one or the other. No changes. I've been swinging back and forth over HW vs SW problems. I've eliminated just about all I can HW wise - all that's left is the CPU and power supply.

I can try some sort of other live DVD - although I need to go through the hoops to make it work from a USB stick - no CD/DVD/etc drive installed.

Krinn - thanks for the pointer. I had -march=native before. My intent was to un-optimize it even more - just generic x86-64. The way I understand it native could be using SSE/ etc... I was trying to remove even these usages to pare down my problem.

I'll read up more to see what the appropriate march is - probably generic?

Without having examined everything in detail, a bad opcode
is almost always a result of an inappropriate CFLAG. (Bad
assembly code that uses an opcode not supported by the cpu,
a binutils bug, or a gcc bug would be possible, too, if less common.)

The kernel pretty much sets its own CFLAGS, though, so if you
have the correct architecture and use a stable gcc, inappropriate
CFLAGS would be pretty rare in kernel compiles. I would look for
something in "Processor Type and Features" in the kernel .config.

A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.

You have established that it is not a heat related issue or a RAM issue. I suspect the problem is related to the disk drive subsystem since that is being exercised during your failures but was not exercised in any of your hardware tests. It is also possible the problem is the CPU.

Unfortunately, the next level of tests involve swapping either the CPU or the motherboard. I suggest you report this as defective product and try to get a refund or replacement.

For the C/E-series CPUs, don't use -march=k8 in the kernel or make.conf.

I was wondering why someone would use CONFIG_GENERIC_CPU with a k10
architecture chip. So these AMD apus are not k10s (perhaps some features
in common, but not drop-in replacements that will necessarily run
the same compiled code). While that module may not be the cause of the error,
one wonders if the lmsensors k10temp module actually works with the AMD HSA
(Fusion) architectures.

I'll take the K10temp module out next time I recompile. I only added it to check the temps when I started noticing the crashes. It was crashing without it.

Quote:

A software bug should have been much more repeatable, therefore despite passing of your hardware tests I think it must be a hardware bug. The problem is that something complicated is triggering the hardware bug so it is not caught by the simpler tests.

The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.

I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups.

Code:

CONFIG_X86_RESERVE_LOW=64

I'll try upping this next...

I'm also going to try and change my test to read from the raw drive(s) instead of the raw RAID device. Those results may be informative.

The test is very repeatable. The dd command above fails EVERY time - it never completes successfully, always get a kernel crash. Sometime it takes a little longer, but it always crashes.

Is the crash always in the same place in the code? When I first installed Gentoo, I had a hardware issue where I could consistently get my machine to crash when I was doing big compiles when using a ReiserFS but not ext2. But no two crashes were identical.

Quote:

I'm trying to dig more to convince myself it's hardware. Think it might be trouble getting an RMA for this - I'm not even convinced myself it's HW. Gotta replace the whole motherboard and CPU - it's a combo (CPU is BGA soldered to the board). The thing works fine for just about everything else till I start hitting it hard with the backups.

Usually there is a time limit on an RMA and the big question is who will pay for shipping. You don't have to ship it back immediately but I don't want you to miss out on options while you are trying to diagnose the problem. I think you should get the RMA process in motion. Ideally, you could discuss it with someone and they would give you more time for further testing before you have to send it back.

I really do think you have a hardware problem that is difficult to diagnose. I've run into a few of these over the years and they can suck up a tremendous amount of time and energy. At some point you need to treat it like it's hardware problem even if you can't prove (even to yourself) that the problem is hardware. It is now extremely unlikely the problem is a bad instruction in the code. If it were, you'd be much closer to pinpointing where in the codebase the problem is.

There is no way mis-tuning CONFIG_X86_RESERVE_LOW could cause the problems you have if they are due to software. If so, then the kernel is garbage and I know it is not garbage. If you want to play around with things to see if you can work around the bug then you could try turning off multi-core support. If a non-smp kernel did work then that would be further evidence of a hardware problem although it would not constitute proof.