Following on from part 1, part 2 and part 3 here is … part 4 of this deep dive series on the Bull’s BCS Architecture.

In the previous post I focussed on Intel RAS features that Bull’s BCS Architecture is leveraging to make the memory more reliable and available.

In part 4, I will cover additional features leverage by Bull’s specific server architecture. Some of these features address directly customers who require the level of reliability, availability and serviceability they could only find in expensive mainframe systems.

Reliability

Reliability addresses the ability of a system or a component to perform its required functions.

Dual Path IO HBAs

Each bullion module provides the ability to connect up to 3 HBA’s per IO Hub aka IOH which is an Intel component that provides the interface between the IO components such PCIe buses and the Intel QPI based processors. Those 3 HBA’s can then be mirrored inside the same bullion module to the HBA’s attached to a second IO Hub. This teaming gives you a fault tolerant IO connectivity and associated with VMware’s Native Multipathing Plugin (MPP), you load balance the IO across the members of the teaming.

Four-Socket Two Boxboro IOH Topology – Courtesy of Intel

Availability

Availability of a system is typically measured as a factor of its reliability – as reliability increases, so does availability.

Active/Passive Power-supplies

The bullion servers are equipped with two 1600W power supplies, which are 80+ Platinum level certified. They provide a full N+N redundancy for maximum availability.

For its mainframe systems, Bull has developed a patented solution based on an active/passive power supply principles. This patented solution provides the highest efficiency rate possible, regardless the requirements and still provide a maximum uptime possible.

This technology from mainframe systems is now available on the bullion.

What is it exactly? The unique active/passive power supply solution provides an embedded fault resiliency against the most common electrical outages: micro-outages.

Rather than having to rely on heavy and expensive UPS systems bullion servers are equipped with an ultra-capacitor which provides the ability to switch from the active to the passive power supply in case of failure, as well as being protected against micro outages.

The ultra-capacitor provides a 300ms autonomy, sufficient to switch-over or to avoid application un-availability during micro-outages.

Bullion’ s Ultra-Capacitor – Courtesy of Bull

The passive PSU rotates and it is frequently tested with failover and failback runs to guarantee its availability in case of a failure of the active PSU.

Bull announces a global consumption of 20-30% below competition.

Serviceability

It refers to the ability of technical support personnel to install, configure, and monitor computer products, identify exceptions or faults, debug or isolate faults to root cause analysis, and provide hardware or software maintenance in pursuit of solving a problem and restoring the product into service.

Empower Maintainability

To ease the replacement of the most frequently failing motorized components, such as the ventilators, power-supplies and disk-drives which are responsible for over 80% of hardware failures, with no impact whatsoever in the production on bullion servers since they is always a redundant part available to take over the failed one.

Replacing these components are now part of the Customer Replaceable Units (CRU’s). This program empowers you to repair your own machine. Other server vendors have the same policy actually. In situations where a computer failure can be attributed to an easily replaceable part ( a CRU), Bull sends you the new part. You simply swap swap the old part for the new one, no tools required. It is simple and a major advantage: really fast service for you and reduced support and maintenance fees.

Increase Availability

On the other side, there are components replaceable only by Support. They are part of the Field Replaceable Units (FRU).

To avoid downtime for the customer, and under the correct conditions, some FRUs can be excluded from the system at boot time: PSUs, processors, cores, QPI links, XQPI links, PCIe boards, embedded Ethernet controllers are among the elements which can be excluded at boot time and minimize downtime during serviceability.

Now I want to talk about another feature that Bull’s BCS Architecture is leveraging: Intel RAS

What is RAS and what is its purpose?

Today’s crucial business challenges require the handling of unrecoverable hardware errors, while delivering uninterrupted application and transaction services to end users. Modern approaches strive to handle unrecoverable errors throughout the complete application stack, from the underlying hardware to the application software itself.

RAS Flow – Courtesy of Intel

Such solutions involve three components:

reliability, how the solution preserves data integrity,

availability, how it guarantees uninterrupted operation with minimal degradation,

serviceability, how it simplifies proactively and reactively dealing with failed or potentially failed components.

This post covers only the memory management mechanisms providing reliability and availability. Next post will cover other mechanisms.

Memory Management mechanisms

Memory errors are among the most common hardware causes of machine crashes in production sites with large-scale systems.

Researchers observed more than 8 percent of DIMMS and about one-third of the machines in the study were affected by correctable errors per year.

At the same time the annual percentage of detected uncorrected errors was 1.3 percent per machine and 0.22 percent per DIMM.

Capacity of memory module has increased – following Moore’s law – over the last two decades. In the 80’s you could buy 2MB memory modules, 20 years later, 32GB memory modules hit the market. That is a 16,000x improvement.

One of the unique reliability and availability features of the bullion is its RAM memory management and memory protection. From basic ECC up to Memory Mirroring, memory protection mechanisms can guarantee up to 100% memory reliability on the bullion.

Let’s have a look at some of those memory protection mechanisms available in the bullion:

ECC memory

Over and above traditional memory correction mechanisms, such as ECC memory, which maintains a memory system effectively free from single-bit errors.

Double device Data Correction (DDDC)

DIMM & Rank Sparing

The commonly available DIMM Sparing is now being enhanced to provide Rank Sparing. With Rank Sparing of dual rank DIMM’s, only 12.5% is being used to enhance the reliability of the memory system. If the level of ECC corrected errors becomes too high, it fails over the spares. Note that DIMM and Rank Sparing does not protect against uncorrectable memory errors.

DIMM Sparing- Rank Sparing – Courtesy of Bull

MCA Recovery

In a virtualized environment, the Virtual Machine Manager (VMM) shares the silicon platform’s resources with each virtual machine (VM) running an OS and applications.

In systems without MCA recovery, an uncorrectable data error would cause the entire system and all of its virtual machines to crash, disrupting multiple applications.

With MCA recovery, when an uncorrectable data error is detected, the system can isolate the error to only the affected VM. Here the hardware notifies the VMM (Support for VMware vSphere 5.x), which then attempts to retire the failing memory page(s) and notify affected VMs and components.

If the failed page is in free memory then the page is retired and marked for replacement, and operation can return to normal. Otherwise, for each affected VM, if the VM can recover from the error it will continue operation; otherwise the VMM restarts the VM.

In all cases, once VM processing is done, the page is retired and marked, and operation returns to normal.

It is possible for the VM to notify its guest OS and have the OS take appropriate recovery actions, and even notify applications higher up in the software stack so that they take application-level recovery actions.

Here is a video demoing the MCA Recovery (MCAR) with VMware vSphere 5.0

Here is a diagram of MCA recovery process:

Software-Assisted MCA Recovery Process – Courtesy of Intel

MCA Recovery is cool but the main drawback it does not offer 100% memory reliability. The scrubbing process that goes through all memory pages to detect the unrecoverable error takes some time, and a few CPU cycles too.

If you’re fortunate enough the MCA Recovery detects the error and reports to the VMM (VMware vSphere 5.x) otherwise you end up most probably with a purple screen of death.

Mirroring Mode

For 100% memory reliability, bullion use memory lockstep. Data are written simultaneously in two different memory modules in lockstep mode. It is the best memory protection mechanism for both reliability and availability as it protects against both correctable and uncorrectable memory errors. On four memory channel systems such the bullion, you cut your available number of DIMM slots by 1/2.

Now let’s deep dive in to these two key functionalities. Bear with it is quite technical.

Enhanced system performance with CPU Caching

CPU caching provides significant benefits for system performance:

Minimizes inter-processor coherency communication and reduces latency to local memory. Processors in each 4-socket module have access to the smart CPU cache state stored in the eXternal Node Controller, thus eliminating the overhead requesting and receiving updates from all other processors.

Dynamic routing of traffic.
When an inter-node-controller link is overused, Bull’s dynamic routing design avoids performance bottleneck by routing traffic through the least-used path. The system uses all available lanes and maintains full bandwidth.

BCS Chip Design – Courtesy of Bull

With the Bull BCS architecture, through CPU caching and coherency snoop responses consume only 5 to 10% of the Intel QPI bandwidth and that of the switch fabric. Bull implementation provides local memory access latency comparable to regular 4-socket systems and 44% lower latency compared to 8-socket ‘gluesless’ systems.

Via the eXtended QPI (XQPI) network a Bull 4-socket module communicates with the other 3x modules as it was a single 16-socket system. Therefore all accesses to local memory have the bandwidth and latency of a regular 4-socket system. Actually each BCS has an embedded directory of 144 SRAM’s of 20 Mb each for a total memory of 72 MB.

Adding to that, the BCS provides 2x MORE eXtended QPI links to interconnect additional 4-socket modules where a 8-socket ‘glueless’ system only offers 4 Intel QPI links. those links are utilized more efficiently as well. By recording when a cache in a remote 4-socket module has a copy of a memory line, the BCS eXternal Node-Controller can respond on behalf of all remote caches to each source snoop. This removes snoop traffic from consuming bandwidth AND reduces memory latency.

Rapid recovery with an improved error logging and diagnostics information.

Bullion Multi-Modules BCS Design – Courtesy of Bull

What about RAS features?

Bull designed the BCS with RAS features (Reliability, Availability and Serviceability)consistent with Intel’s QPI RAS features.

The point-to-point links – that you find in QPI, Scalable Memory Interconnect (SMI) and BCS fabric – that connect the chips in the bullion system have many RAS features in common including:

Cyclic Redundancy Checksum (CRC)

Link Level Retry (LLR)

Link Width Reduction (LWR)

Link Retrain

All the link resiliency features above apply to both Intel QPI/SMI and the X-QPI fabric (BCS). They are transparent to the hypervisor. The system remains operational.

XQPI Cabling for a 16 Sockets bullion – Courtesy of Bull

In part 3 I will write about how Bull improves memory reliability by forwarding memory error detections right into the VMware hypervisor to avoid purple screen of death. This is not science fiction! It is available in a shop near you 🙂

Those who are using Graylog2 know how a powerful syslog server it is. And you do know as well how painful it is to install and configure. Furthermore, those who have been using it to collect ESXi logs have noticed that Graylog2 doesn’t support ESXi 5.x log format. ESXi 4.x log format are perfectly handled though. Let’s kill two birds with one stone 🙂

service elasticsearch start
service mongodb restart
service graylog2-server start
service rsyslog restart
service apache2 restart

Following the format of the messages, you may require a reverse DNS to do hostname to IP lookups. On the screenshot below you will notice logs from ESXi 5, pfSense and Astaro/UTM. We have also validated this configuration for ESXi 4, FreeNAS,NTsyslog, Snare/Epilog and nxlog.

BCS Architecture

The BCS enables two key functionalities: CPU caching and the resilient eXternal Node-Controller fabric. These features server to reduce communication and coordination overhead and provide availability features consistent with Intel Xeon E7-4800 series processor.

BCS meets the most demanding requirements of today’s business-critical and mission-critical applications.

As shown in the above figure, a BCS chip sits on a SIB board that is plugged in the main board. When running in a single node mode, a DSIB (Dummy SIB) board is required.

BCS Architecture – 4 Nodes – 16 Sockets

As shown in the above figure, BCS Architecture scales to 16 processors supporting up to 160 processor cores and up to 320 logical processors (Intel HT). Memory wise, BCS Architecture supports up to 256x DDR3 DIMM slots for a maximum of 4TB of memory using 16GB DIMMs. IO wise, there are up to 24 IO slots available.

BCS key technical characteristics:

ASIC chip of 18x18mm with 9 metal layers

90nm technology

321 millions transistors

1837 (~43×43) ball connectors

6 QPI (~fibers) and 3×2 XQPI links

High speed serial interfaces up to 8GT/s

power-concsious design with selective power-down capabilities

Aggregated data transfer rate of 230GB/s that is 9 ports x 25.6 GB/s

Up to 300Gb/s bandwidth

BCS Chip Design – Courtesy of Bull

Each BCS module groups the processor sockets into a single “QPI island” of four directly connected CPU sockets. This direct connection provides the lowest latencies. Each node controller stores information about all data located in the processors caches. This key functionality is called “CPU caching“. This is just awesome!

In my two previous posts, I’ve been introducing the concept of ‘glueless’ and ‘glued’ as the two main scale-up architectures. You can read them here and here. Eventually you may also read this post in the series talking about the need to go now for a scale-up approach to virtualize the last bit, that is resource-hungry business and mission critical applications.

We’ve seen that the ‘glued’ architecture is the best architecture choice to scale-up beyond 4- and 8-socket systems. We’ve also noticed that the quality of the OEM-developed eXternal Node-Controllers is critical.

Meet the Bull Coherence Switch Architecture. BCS Architecture is Bull’s implementation of the glued eXternal Node-Controller. It is the design foundation for Bullion x86 servers that need to deliver more scalability, resiliency, and efficiency to meet requirements of the most demanding applications in the business computing.

A bit of history. The BCS technology is the foundation of bullx Supernodes series of supercomputers designed to run HPC applications that require huge volume of shared resources, in particular shared memory.

Bull decided to leverage that technology into their bullion series pushing the limit of x86 enterprise-class servers to a new level.

The bullion server achieved peak performance of 4,110 according to the SPECint®2006 benchmark – Courtesy of Bull

SPECint®_rate2006 data, July 2012

These results show that not only Bull’s ‘glued’ architecture is the way to go for scale-up architecture, but also that Bull engineered a master piece of technology, that is the BCS.

Remember that one the main drawbacks of the ‘glueless’ architecture is that up to 65% of Intel QPI links bandwidth is consumed to address QPI source broadcast snoopy protocol, that is maintaining cache coherency when socket increases. The performance increase is not linear with the number of added resources and your limited to 8-socket systems!

Bull’s BCS solves these issues and shows that you can scale up beyond 8-socket systems without compromising performance. HPC technology delivered to x86 enterprise-class servers. Thanks to the Bull’s BCS eXternal Node-Controller!

In the next blog posts we will deep dive the BCS technology and uncover the secret sauce that makes the BCS sooo awesome!

So in my previous article we have discussed about the ‘glueless’ architecture. You may want to read part 1 before proceeding.

We have seen that ‘glueless’ architecture as some serious drawbacks. Let’s see if the second main scale-up server architecture can mitigate those issues. Meet…

The ‘glued’ architecture

We’ve seen that the ‘glueless’ architecture, coordination and communication between the processor sockets, creates a bottleneck. To overcome this problem hardware manufacturers have developed a ‘glue’ to the architecture. This ‘glued’ architecture uses external node-controllers to interconnect QPI island, that is kind of clusters of processor sockets.

Glued Architecture

Intel QPI links offers a scalable solutions based on OEM-developed eXternal Node-Controllers (referred to as XNC). External node-controllers using the Intel Xeon E7-4800 series with embedded memory controller implies a Cache Coherent Non-Uniform Memory Access (ccNUMA) system. The role of ccNUMA is to ensure cache coherency by tracking the most up to date data is for every cache line held in a processor cache.

Latency between processor and memory in a ccNUMA system varies depending on the location of these two components in relation to each other. Manufacturers also want to minimize the bandwidth consumption resulting from the coherency snoop (Intel QPI source broadcast snoopy).

Therefore the quality of the OEM-developed eXternal Node-Controllers is critical and only a few manufacturers are able to provide server architecture which scale in pace with resources added to the system.

The next article in this series, I will focus on Bull’s eXternal Node-Controller called the BCS. Stay tuned!

DISCLAIMER

Views expressed here are mine, they are not read or approved in advance by any company and don’t reflect the views of my employer, my employer’s business partners, or clients. I am solely responsible for all content produced here. No information provided here was reviewed by or endorsed by my employer or any other vendor or organization. This is my own blog. Comments are moderated!