Memory Errors aren’t the end of the World (or Data)

The words data corruption can immediately strike fear in the hearts of IT managers. Data corruption occurs when errors are experienced during writing, reading, storage, transmission, or processing of the data in your server. These data errors can cause anything from a minor and temporary loss of data to a complete system crash or a permanent loss of data. Unfortunately, uncorrected memory device errors are the leading cause of server crashes.

Three IT trends seem to increase the likelihood of memory errors. The first trend is the increasing need for higher capacity memory. In the past few years, the average memory capacity per server has grown by more than 500%, which translates to trillions of memory cells.

Additionally, DRAM technology is changing to meet the demand for higher DIMM storage capacity, but newer designs increase the number of bits that may be negatively affected by an ionizing event.

Finally, 24/7 operations can also contribute to memory errors, causing one in three systems to experience one or more correctable memory errors each year.

But experiencing a memory error doesn’t necessarily mean the end of your carefully accumulated and analyzed data. By avoiding a critical failure, a system crash can be avoided. HPE server memory provides an increasingly comprehensive suite of memory error detection and correction features called Server Memory RAS (reliability, availability, and serviceability) that can help IT managers rest easier.

A short explanation of RAS:

Reliability: The ability of a system component to consistently perform according to its specifications while avoiding, detecting and repairing hardware faults.

Availability: The ratio of time a system or component is functional to the total time it is expected to function. In other words, the system stays operational even when faults occur.

Serviceability: The simplicity and speed with which a system can be repaired or maintained with as little disruption as possible.

RAS features typically provide one or more key correcting elements, including duplication, recoverability, automatic updates, data backup and data archiving. One of the more well-known RAS features is Error Code Correction, which handles minor amounts of data corruption, identifies failing DIMMs so they can be replaced and protects data in the DRAM and in transit.

Another popular RAS feature is Single Device Data Correction. This feature provides data recovery if a DRAM fails while allowing the system to continue running even in the presence of correctable errors. The Double Device Data Correction feature corrects errors in two symbols and detects errors in three symbols. In other words, if one DRAM chip fails, the system will continue to run even if a second chip fails.

A fourth popular RAS feature is Memory Mirroring, which provides protection against uncorrectable memory errors that would otherwise result in system failure.

These are just a few of the many RAS features available on Hewlett Packard Enterprise’s current server portfolio. Data centers that take advantage of the full suite of Hewlett Packard Enterprise’s memory RAS capabilities in their HPE servers are able to improve the prediction of critical memory error conditions, Prevent unnecessary DIMM replacement due to noncritical errors and subsequently increase server uptime. In fact, HPE SmartMemory finds and fixes 99.9998% of memory errors.