It will report memory, cpu, hdd or any other error. It depends on whether it will be vague or specific or how much you know about the Linux kernel or at-least how the code is organized and will come from the driver that is running the hardware and you'll have to figure out if it is hardware or the driver.

For disks, it will report a bunch of information that will help you determine if it's bad.

On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though._________________Space alien weds two headed Elvis clone.

CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware._________________Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

It depends on whether it will be vague or specific or how much you know about the Linux kernel or at-least how the code is organized and will come from the driver that is running the hardware and you'll have to figure out if it is hardware or the driver.

OK, that doesn't really help. Are you aware of any example search terms? "There might or might not be something reported" isn't exactly helpful, so I'm trying to identify what exactly it is I need to know so I may identify the errors.

notageek wrote:

On CPU for instance, I have posted quite a while back, where all my cores will not show up. Used dmesg to troubleshoot, the issue is still unresolved though.

I'll see if I can find the thread.

Thanks.

energyman76b wrote:

CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware.

Right, but that doesn't explain how to track down or observe reported errors from linux. Are the errors reported through the kernel to system logs, or some other inconsistent solution?_________________lolgov. 'cause where we're going, you don't have civil liberties.

Interesting. That it doesn't occur with Fedora 11 makes it appear to be a kernel issue.

I'd have to see if I can track down an example, but I was thinking more along the lines of bit errors which could be memory or possibly a CPU. Otherwise both memory & CPU are functional, at least until the error is encountered another time._________________lolgov. 'cause where we're going, you don't have civil liberties.

Interesting. That it doesn't occur with Fedora 11 makes it appear to be a kernel issue.

I'd have to see if I can track down an example, but I was thinking more along the lines of bit errors which could be memory or possibly a CPU. Otherwise both memory & CPU are functional, at least until the error is encountered another time.

That was the example.

Yes, it is a kernel issue (probably) or a hardware issue. I was castigated in these forums, when I suggested Fedora has an almost proprietary kernel.

The other point is, it depends on the message on how useful it is. dmesg gets your work done most of the time and if you see an obscure error, it is either a hardware issue or a driver/kernel issue._________________Space alien weds two headed Elvis clone.

CPU: mce - available in consumer hardware
memory: ECC - available in consumer hardware
pcie has error reporting facilities - but I don't know if they are available in consumer hardware.

There are any number of interacting means (buses and protocols) by which hardware errors may be internally or externally communicated by a PC (ECC, MCE, ACPI, SMBUS PMBUS, I2C, DMI, SNMP, WMI, IPMI, etc.). The bottom line from a single machine user perspective is that it all gets dumped into the logs. From a multi-machine admin perspective, the enterprise tools can collect and handle it (gathering from DMI, SNMP, WMI, IPMI, network logging, and handing by any of the various systems management suites).

I was castigated in these forums, when I suggested Fedora has an almost proprietary kernel.

IMO RH is very proprietary-like. CentOS is not RHEL, so RHEL is not truly available. Things which work on RHEL do NOT always work on CentOS, further proving the point (as far as I'm concerned). I wouldn't be shocked if they tweaked the kernel with "inside knowledge" which was still "made available" even if obscure. I'm not a fan of RH. IMO they meet the "letter of the law" but not the intent.

notageek wrote:

The other point is, it depends on the message on how useful it is. dmesg gets your work done most of the time and if you see an obscure error, it is either a hardware issue or a driver/kernel issue.

That makes a little more sense. I'll see if I can find some examples._________________lolgov. 'cause where we're going, you don't have civil liberties.

The bottom line from a single machine user perspective is that it all gets dumped into the logs.

If true, then it should be a matter of just identifying how the information is logged, which seems difficult to track down. Admittedly I haven't spent a long time searching, and I don't have an actual error I'm looking for, but my initial searches didn't reveal much useful. I'll have to spend some time guessing at keywords._________________lolgov. 'cause where we're going, you don't have civil liberties.

There are error injectors you can load as a module and then use to simulate some types of hardware errors. Some log analysis programs come with examples, and people share their rules. Most commercial systems management products come preconfigured to deal with common problems.

I can think of two examples, both memory related, mainly because those are most common. One is when a random reboot occurs and memory is the culprit. The server otherwise runs fine, until a sudden reboot.

On Solaris, I can track the number of errors on a particular memory module so I know exactly which module to replace. Recurring instances could be an indication of a bad CPU. The module is also easily associated with a specific CPU.

I'm trying to identify similar means of identifying hardware problems under linux._________________lolgov. 'cause where we're going, you don't have civil liberties.

I can think of two examples, both memory related, mainly because those are most common. One is when a random reboot occurs and memory is the culprit. The server otherwise runs fine, until a sudden reboot.

On Solaris, I can track the number of errors on a particular memory module so I know exactly which module to replace. Recurring instances could be an indication of a bad CPU. The module is also easily associated with a specific CPU.

I'm trying to identify similar means of identifying hardware problems under linux.

random reboot = triple fault. There is nothing to log because it is an automatism outside of the control of the kernel.
http://en.wikipedia.org/wiki/Triple_fault_________________Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

Triple faults indicate a problem with the operating system kernel or device drivers. In modern operating systems, a triple fault is typically caused by a buffer overflow or underflow in a device driver which writes over the interrupt descriptor table. When the next interrupt happens, the processor cannot call either the needed interrupt handler or the double fault handler because the descriptors in the IDT are corrupted.[citation needed]

_________________lolgov. 'cause where we're going, you don't have civil liberties.

Triple faults indicate a problem with the operating system kernel or device drivers. In modern operating systems, a triple fault is typically caused by a buffer overflow or underflow in a device driver which writes over the interrupt descriptor table. When the next interrupt happens, the processor cannot call either the needed interrupt handler or the double fault handler because the descriptors in the IDT are corrupted.[citation needed]

I don't know where you got that. But triple fault is typical for memory errors and power fluctuations._________________Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

Recent hardware have started incorporating enterprise features, but yes, these error detection hardware features used to be in the realm of only high availability/enterprise machines. And Linux being ported over to these enterprise hardware now means code is trickling down on how to deal with these errors. A lot of the time the code has to be tailored for the hardware.

But even still, not all failure modes are detected. Plus a lot of the failures are machine specific, CPU specific even, and decoding any data that caused the problem sometimes isn't always available...

I do get some MCE logs on my AthlonXP that I have not been able to find documentation on how to decode the bits... Grr...

Recent hardware have started incorporating enterprise features, but yes, these error detection hardware features used to be in the realm of only high availability/enterprise machines. And Linux being ported over to these enterprise hardware now means code is trickling down on how to deal with these errors. A lot of the time the code has to be tailored for the hardware.

But even still, not all failure modes are detected. Plus a lot of the failures are machine specific, CPU specific even, and decoding any data that caused the problem sometimes isn't always available...

I do get some MCE logs on my AthlonXP that I have not been able to find documentation on how to decode the bits... Grr...

I don't know where you got that. But triple fault is typical for memory errors and power fluctuations.

It came from your wikipedia link to Triple fault.

I just gave you the link, I never read it.

Seriously, what is more likely: that some driver running on million of boxes is behaving just for you?
or
some hardware fault?_________________Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

Well, since it is new to me, I can't say one way or the other. Given what you have indicated, it would seem like a hardware error.

What it sounded like it was describing wasn't inherently a common problem everyone using a driver would encounter. So if it can be a driver, or it can be hardware, that helps. A crash with a driver not known to have problems would indicate hardware or a rare condition bug. Obviously hardware would be easier to test in that case.

I've seen an arrangement of hardware result in discovery of a driver bug not otherwise encountered, but given that hardware arrangement, the bug was observable under repeatable conditions._________________lolgov. 'cause where we're going, you don't have civil liberties.

I had my choice of random reboots in the past. Everytime it boilt down to:
memory.
or
power.

With my personal boxes, family, friends, at work. So, if someone tells me about random reboots, first thing I do today:
get a different psu
then
start testing the ram_________________Study finds stunning lack of racial, gender, and economic diversity among middle-class white males

Makes sense. My original hope with the thread was to identify errors rather than random replacement of parts and lengthy memory testing. Not for personal use, but business use. But it seems like the hardware and/or software features aren't yet in place._________________lolgov. 'cause where we're going, you don't have civil liberties.