Dell R415 R515 (11G) Random Reboots

We have two identical racks of Dell servers that each include an R415 and an R515. These four servers were rebooting periodically, typically within 6 weeks, but sometimes within a couple of weeks or a couple of days. No kernel panic, no operating system errors, just system reboots, as if the plug was pulled and reinserted.

Please note: the problem described below has not been resolved for these machines. Varying BMC/iDRAC firmware levels and specific configuration of IPMI enables seems to change the periods between reboots, but they don’t appear to have been fixed.

For more background information, read on, but please bear in mind that the problem still exists. This page will be updated when more information is available…

None of the other servers in either of the racks was showing this problem, despite having a very similar configuration (operating system, network configuration, etc.) The racks are in different data-centres, so environmental differences seemed unlikely.

The BMC/iDRAC6 system event logs variously showed as a faulty DIMMs (ECC failure), CPU Machine Checks, power supply sensor failures – communication error, and OEM Diagnostic Events, sometimes with dates in the 1970’s. The hardware was quite obviously fine – slim chance of genuine failures occurring across four servers bought in two batches a couple of months apart.

Downgrading the BMC/iDRAC6 firmware from 1.70 to 1.54 made no difference, nor did upgrading to iDRAC6 to 1.80 (no upgrade was available for the BMC). All other firmware was updated to the very latest as of February 2012, but reboots were still happening.

Dell’s support were little more than a waste of my time. In spite of the obvious conclusion that hardware components are not actually faulty, and that the errors are red-herrings, they insisted that an engineer come to replace parts – no rationale behind this, other than to appear to be doing something. I didn’t have time to waste pointlessly messing about with production servers.

I was suspicious that the servers giving problems had chipsets reported by lspci as ‘Dell Device 0488‘ and ‘Dell Device 0489‘ – these servers were all very similar in terms of their motherboards.

It seemed very much like the BMC was receiving corrupt data, either on the SMBus or else somehow reaching the Event Message Buffer.

I tried setting PCI registers to prevent any initialisation of the PCI->SMBus bridge, in case another PCI device was corrupting the bus in a way that was pushing random data onto the SMBus, but this made no difference.

Periodically cold restarting the BMC or iDRAC6 effectively made no difference.

My last attempt to avoid these reboots was to disable the BMC’s Event Message Buffer and System Event Log. I also took the opportunity to set the BMC time correctly (it was out by an hour on my servers), and I decided to reboot the BMC before disabling the event buffer and system log. So, in order: –

I had deliberately left the event_msg option enabled on one of the four servers to serve as a control for the experiment, and sure enough this was the one that rebooted (after 6 weeks). I disabled the option on this machine.

So far at least, disabling the Event Message Buffer seems to help prevent the reboots, but the two R415 servers did reboot after three months. The event_msg option had been re-enabled in the BMC, I suspect by the Open Manage software on startup, so I can’t tell if it had been re-enabled prior to the reboot.

I’ve decided to cold-reset the BMC once every 24 hours to see if this prevents the corruption and 3 monthly reboots on the R415s. Time will tell…

I may try re-enabling the System Event Log if and when I feel convinced that the reboot problem is solved, but for now I’d rather have a stable motherboard than system management data, especially if the management controller itself is causing the problems!

Based on my own experience, and similar experiences I’ve seen reported by others on the Internet, I suspect that there is a bug in Dell’s BMC that manifests itself in various ways according to the system configuration. This may, of course, be in response to other errors, such as the PSU firmware corrupting the SMBus, but the BMC should be resilient against external anomalies. I guess that for many people it’s benign, but under certain configurations it will have more obvious effects.

If I’m wrong, then the bus corruption must be directly affecting another component – e.g. the reset is not being mediated by the BMC, but rather the corruption is inadvertently telling the CPU to reset. The CPU communications look to be on an entirely separate bus (according to one system management diagram that I’ve seen), but I can’t find enough definitive information.

If you’ve had similar problems with Dell 11G servers that sound like they might be BMC related, please leave a comment to share your experience.

I’ve had the same problem with a Dell R415 server used for running QEMU VMs for students. About 6 weeks after installation it rebooted once, then more frequently. Under load it can reboot every few minutes, but even when idle with nobody logged in it has rebooted.

Interesting – none of my servers seem to crash more frequently under load, but otherwise the symptoms are similar. I’d be tempted to use a single CPU in each possible CPU/socket configuration – particularly if you can force a reboot simply by stressing the system (e.g. it’s easily to replicate the error).

I’d also suggest comparing the logs from a reboot under heavy load with those from a random reboot under normal conditions (if you haven’t already). It’s not impossible that they have different causes.

Hi Bob, since updating the iDRAC/BMC firmware to version 1.95 and BIOS to 1.9.3, I’ve had no further spurious reboots on any of the four problem servers. The update to 1.92 (as I recall) was described as an ‘Urgent security fix’, with no elaboration, which makes me think that Dell finally fixed the error but didn’t want to acknowledge it.

On the other hand, I’ve been here before, thinking it was solved, only to be greeted with a reboot notification at 3am the following morning! So, the jury’s still out as far as I’m concerned.

If you’re still getting the problems on the latest updates, please get in touch by email and we can compare notes…

Not known to be fixed, but the R415 has had all latest firmware applied, and is now running Debian/Jessie (needed some tinkering to get the bnx2 firmware loaded, after a minimal install. Ubuntu installed without a hitch). On RHEL6, there was also an error about a missing module for CPU exception handling (I forget the details) – you may want to check your own dmesg output for kernel startup messages.

Did you ever get this resolved? We have an R515 server that every so often goes down citing a ‘CPU # machine check detected’ followed by a number of ‘An OEM diagnostic event has occurred’ entries in the logs.

I had thought maybe one of the CPUs was duff, but have noticed it can mention either CPU 1 or 2, so am stumped. Do you think making your changes has solved it? Shall I just get rid of the server – it’s useless like this?

I’m not sure – the servers are now out of commission. I did discover a package (part of the RHEL6 yum repository) that specifically installed a handler for CPU machine check exceptions. Unfortunately I don’t have the machine running to check my bash history for exactly which package I installed. Search around ‘cpu’, ‘exception’, and ‘machine’ with yum to try to locate it. It never rebooted since installing (and also bringing all firmware up to date), but it was never left running for any significant length of time.

My memory is hazy on this, it may even have simply been a kernel module that I configured to load to handle this (perhaps grep for mce, or cpu). I’d be interested to know how you get on.