Now I want to talk about another feature that Bull’s BCS Architecture is leveraging: Intel RAS

What is RAS and what is its purpose?

Today’s crucial business challenges require the handling of unrecoverable hardware errors, while delivering uninterrupted application and transaction services to end users. Modern approaches strive to handle unrecoverable errors throughout the complete application stack, from the underlying hardware to the application software itself.

RAS Flow – Courtesy of Intel

Such solutions involve three components:

reliability, how the solution preserves data integrity,

availability, how it guarantees uninterrupted operation with minimal degradation,

serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components.

This post covers only the memory management mechanisms providing reliability and availability. Next post will cover other mechanisms.

Memory Management mechanisms

Memory errors are among the most common hardware causes of machine crashes in production sites with large-scale systems.

Researchers observed more than 8 percent of DIMMS and about one-third of the machines in the study were affected by correctable errors per year.

At the same time the annual percentage of detected uncorrected errors was 1.3 percent per machine and 0.22 percent per DIMM.

Capacity of memory module has increased – following Moore’s law – over the last two decades. In the 80’s you could buy 2MB memory modules, 20 years later, 32GB memory modules hit the market. That is a 16,000x improvement.

One of the unique reliability and availability features of the bullion is its RAM memory management and memory protection. From basic ECC up to Memory Mirroring, memory protection mechanisms can guarantee up to 100% memory reliability on the bullion.

Let’s have a look at some of those memory protection mechanisms available in the bullion:

ECC memory

Over and above traditional memory correction mechanisms, such as ECC memory, which maintains a memory system effectively free from single-bit errors.

Double device Data Correction (DDDC)

DIMM & Rank Sparing

The commonly available DIMM Sparing is now being enhanced to provide Rank Sparing. With Rank Sparing of dual rank DIMM’s, only 12.5% is being used to enhance the reliability of the memory system. If the level of ECC corrected errors becomes too high, it fails over the spares. Note that DIMM and Rank Sparing does not protect against uncorrectable memory errors.

DIMM Sparing- Rank Sparing – Courtesy of Bull

MCA Recovery

In a virtualized environment, the Virtual Machine Manager (VMM) shares the silicon platform’s resources with each virtual machine (VM) running an OS and applications.

In systems without MCA recovery, an uncorrectable data error would cause the entire system and all of its virtual machines to crash, disrupting multiple applications.

With MCA recovery, when an uncorrectable data error is detected, the system can isolate the error to only the affected VM. Here the hardware notifies the VMM (Support for VMware vSphere 5.x), which then attempts to retire the failing memory page(s) and notify affected VMs and components.

If the failed page is in free memory then the page is retired and marked for replacement, and operation can return to normal. Otherwise, for each affected VM, if the VM can recover from the error it will continue operation; otherwise the VMM restarts the VM.

In all cases, once VM processing is done, the page is retired and marked, and operation returns to normal.

It is possible for the VM to notify its guest OS and have the OS take appropriate recovery actions, and even notify applications higher up in the software stack so that they take application-level recovery actions.

Here is a video demoing the MCA Recovery (MCAR) with VMware vSphere 5.0

Here is a diagram of MCA recovery process:

Software-Assisted MCA Recovery Process – Courtesy of Intel

MCA Recovery is cool but the main drawback it does not offer 100% memory reliability. The scrubbing process that goes through all memory pages to detect the unrecoverable error takes some time, and a few CPU cycles too.

If you’re fortunate enough the MCA Recovery detects the error and reports to the VMM (VMware vSphere 5.x) otherwise you end up most probably with a purple screen of death.

Mirroring Mode

For 100% memory reliability, bullion use memory lockstep. Data are written simultaneously in two different memory modules in lockstep mode. It is the best memory protection mechanism for both reliability and availability as it protects against both correctable and uncorrectable memory errors. On four memory channel systems such the bullion, you cut your available number of DIMM slots by 1/2.

Share this:

Like this:

LikeLoading...

Related

About PiroNet

Didier Pironet is an independent blogger and freelancer with +15 years of IT industry experience.
Didier is also a former VMware inc. employee where he specialised in Datacenter and Cloud Infrastructure products as well as Infrastructure, Operations and IT Business Management products.
Didier is passionate about technologies and he is found to be a creative and a visionary thinker, expressing with passion and excitement, hopefully inspiring and enrolling people to innovation and change.

3 Responses to Bull’s BCS Architecture – Deep Dive – Part 3

This is very interesting! Yes, we all know that ECC RAM is required because of random bit flips in RAM. But don’t these bit flips also occurs on hard disks, but on a wider scale? How do you resolve hard disk bit flips?

Hi Mike and thanks for your comment.
Good question indeed. Hard disk also uses ECC technologies as well to mitigate to correct bit flips. Adding on top RAID protection.
Now I don’t know for sure if – for the same capacity – there are more bit flips occurring on HD than memory. I leave that to statisticians 🙂

DISCLAIMER

Views expressed here are mine, they are not read or approved in advance by any company and don’t reflect the views of my employer, my employer’s business partners, or clients. I am solely responsible for all content produced here. No information provided here was reviewed by or endorsed by my employer or any other vendor or organization. This is my own blog. Comments are moderated!