Knowledge Base

As with most systems, the processor is a key component in a server, computing instructions and managing other components such as memory or PCI buses. So when a CPU seems to be having issues, it can be very worrying.

However, physical processor failures are extremely rare. In fact, in the majority of processor replacements, the CPU shows no failure once tested individually. When a CPU does fail, it's usually caused by an electrical surge to the system, cascade failure from another major component failing, or thermal issues. Therefore, it is critical to follow key troubleshooting steps when a processor failure is suspected in order to properly identify the component at fault.

The information and steps provided in this article will help understand the possible source of the issue. Click on the title to expand the section.

Most of the servers of the 11th generation are equipped with the Intel® Nehalem-EP processors. Nehalem-EP is the codename for the 1-2 socket, with up to four core server/workstation processor targeted for the Intel 5520 chipset based platform (compatible with Intel® Xeon® 5500 platform). Nehalem-EP is part of the family of 45 nm processors based on Intel microarchitecture codename Nehalem. More information is available on the manufacturer's website www.intel.com.

The main change with this microarchitecture is that the Memory Controller is now imbedded in the processor. This will have an impact on the server's performance but also on the errors that can be thought as processor errors.

Generation 12:

For this generation of servers, when equipped with Intel processors, the new platform is called Sandy Bridge EP, replacing the Nehalem microarchitecture. The integration of PCI-E lanes into this processor is a new step towards a multipurpose processing unit. More information is available online or on the manufacturer's website www.intel.com

Generation 13:

The 13th generation of PowerEdge servers features the Intel® Haswell EP product family, offering an ideal combination of performance, power efficiency, and cost. More information is available online or on the manufacturer's website www.intel.com

Since the processor interacts with all the components in a server, the symptoms and errors that can occur are very varied.
Here are some examples of common CPU issues with technical articles and troubleshooting steps:

The server will not complete Power On Self Test. This means that a component is blocking the server from starting during the Self Test.Here are some steps to follow to narrow down the list of components that could cause this:

Look for a possible error message on the LCD panel or LED lights on the front of the server. If an error message is available, it will provide some valuable information. You can review the CPU related error messages page or type this error message in a search engine to find more information.

​If the processor has recently been changed, reinstalled or might be physically damaged, you can do a visual check inside the chassis to see if anything has been damaged (CPU or CPU slot on the motherboard for example)

Minimum to POST: Since a component might be causing the No POST situation, removing all unnecessary components to complete POST is an efficient technique.
The list of minimum components will vary depending on the model of server you currently have. Usually this will include: Power Supply, Motherboard, 1 CPU, 1 DIMM. For the exact list of components you can review the user manual for your Dell PowerEdge server.

The symptoms for thermal issues can be very varied: temperature / fan / heatsink error message on the LCD panel, server turning off after a lapse of time and not turning back on right away, system fans working at full speed all the time. Examples of error messages on a Dell PE Server:

CPU0001 - CPU has a thermal trip (over-temperature) eventCPU0010 - The CPU is throttled due to thermal or power conditions.

For more information on CPU related error messages, you can take a look at our dedicated CPU error page.

Here is a list of key points to check in case of thermal issues:

Check the LCD and ESM for any additional error messages to identify the component causing issues.

Ensure the airflow to the machine is not blocked. Placing it in an enclosed area or blocking the vent holes can cause it to overheat. If installed in a rack, make sure the rack cooling system is working ok.

Verify the ambient temperature is within acceptable levels.

Check the internal system fans for obstructions and verify all fans are spinning properly. Swap any failing fans with a known-good fan for testing.

Another example of error messages that refer to the CPU is CPU IErr (for example "E1410 CPU IErr was asserted"). This is usually not an error with the CPU itself, but a sign that the CPU has detected an error in the system, or received an erroneous instruction from a system component. It could be the memory, PCI-E slots, etc.

When in the operating system, the symptoms for a possible CPU issue can be very varied: slow performance, random reboots, CPU errors in the System Logs of the operating system.For PowerEdge servers, there are a few key elements to ensure optimal usage of the processor by the operating system:

Ensure the physical memory configuration of the server is correct as this will have an impact on the processor. The right DIMM must be in the right slot in the right channel for each processor and the total memory size must be balanced between the channels and the processors.

Check the memory configuration in the BIOS. Different settings are available depending on type of behaviour you are looking for (Advanced ECC, Memory Optimized, Mirror). For each setting, the physical memory configuration can change so it's important to verify this.

The server BIOS and iDRAC must be up to date. Any improvement or fix that could impact the processor will be done through a BIOS update so it's very important, when faced with a possible CPU issue, to update the BIOS of the server. The Embedded Server Management (also called BMC or iDRAC depending on the generation) is also an important element to have updated as it directly interacts with all the components in the server.

Important: Updating the BIOS of the server requires a reboot of the server.