This has happened three times during the last month, but never before (server running for 3 years).

From a quick google-search, it seems this is a serious matter.

However, the vendor support technician said:

I have seen these errors MANY times, and unless you are overclocking your CPU - or have had a fan failure or similar - it is VERY unlikely to be a processor
problem. It is more likely that the kernel is misreporting the error.

So - is this a critical error and I should order new parts (replace CPU?) or ignore it?

Were they all around the same time? It's very unlikely that the processor is misreporting the error. But odd things like solar flairs can also cause these errors and are nothing to worry about. If the processor is going bad, well I'd worry about that.
– Chris SNov 28 '12 at 20:51

5

If your system did not change in the last month (no new kernel with reporting options set vs an old one which did not log it etc) then misreporting the issue seems.... uhm... a creative answer from the vendor support technician.
– HennesNov 28 '12 at 20:53

I'd believe it is a hardware error. If it is occurring frequently, then I'd get my support out there and replace the CPU. Otherwise, I might not worry about it.
– mdpcNov 28 '12 at 23:35

You can try swapping two CPUs. If the error goes away, you win. If it follows the CPU, I'd be pretty convinced it was a CPU problem.
– David SchwartzNov 29 '12 at 1:39

2 Answers
2

As for machine check exceptions, these are reported by the hardware; the kernel is just passing the message on to you, so that you can take action before the hardware problem gets out of hand and results in a real disaster.

The only instance I was able to find of a kernel "misreporting" a machine check exception was the following. In this case, it was a flaw in the processor causing the problem, not the kernel.

Intel Xeon processor E7 family processors have an issue in which some c-state transitions can cause false correctable Machine Check Exception (MCE) errors to be reported from MCE bank 6 to the user. On some E7 processor family systems, this resulted in "floods" of MCE errors. This patch disables MCE error reporting for bank 6.

Bottom line: It sounds to me like the vendor is trying to avoid replacing your defective hardware.

On enterprise servers we handled it like this:
Have the vendor replace if the errors are excessive or if they repeat week after week.
Actually, the event monitoring service triggered that all by itself. No questions asked.

Moving to x86 we also got the stories about EDAC/MCE being confused etc.
If the errors keep coming, the hardware should be replaced.

(There's also a low chance of it being connected with big solar events.
It IS possible, but PC hardware being flaky and vendors being reluctant to replace something is far more commonplace)