Performance, Data Integrity, and Controller Architectures

January 21, 2016

By Nick Triantos – Storage Architect

Reliability is a crucial concern for anyone tasked with implementing enterprise data storage systems, and a storage array’s controllers – and controller architecture – are especially important considerations.

Nimble Storage and other major array manufacturers have arrived at rather different solutions to the same set of problems – issues like data integrity, performance, failover times, and more – and it’s useful to compare the technical foundations for these different approaches.

Modern storage arrays typically contain two controllers or nodes, whose function is to service application I/O (input/output) requests, in addition to safely and securely storing and protecting data. Controllers can be configured to operate in isolation (a single controller), as a redundant pair (dual controllers) or functioning as individual nodes (part of a scale-out cluster). Although used interchangeably these days, the term “node” describes a single controller which is part of a larger scale-out cluster but provides no redundancy on its own.

With a couple of variants, there are two main storage array architectures:

Active-Active arrays, in which both controllers are actively processing I/O requests

Active-Standby arrays, in which the active controller is servicing I/O requests, with the standby controller servicing write cache mirroring requests while always being ready to assume the primary responsibility of servicing all I/O requests

Let’s take a closer look at each architecture:

Active-Active Arrays

Many brands of storage arrays use dual active controllers, most often, in an “Active-Active-Asymmetric” (A/A/A) model. The “asymmetric” part refers to the fact that all paths to a volume or LUN (Logical Unit Number) through a target port do not have equal access characteristics. Some paths are shorter (a direct path to the LUN-owning controller) while other paths are longer (an indirect path through a non-owning controller).

In an Active-Active Asymmetric model, there are typically two controllers or nodes with the following characteristics:

Each controller owns a subset of the total number of LUNs or even disks

Both controllers provide data services, with each controller only servicing I/O requests to the LUNs it owns

In the event of a controller failure, the surviving controller takes over operations and services I/O requests for its own LUNs and the LUNs of its failed partner

In theory, having dual active controllers translates to double the processing power in the aggregate. However, reality can be much different and the impact from failover events can potentially have disastrous application ramifications.

Typically, A/A/A arrays do not provide a way of limiting per-controller processing to a ceiling of less than 50%, or at the very least, monitor controller headroom.

While an administrator is able to evenly distribute LUNs across controllers, this practice is nothing more than a simple LUN balancing exercise which does not translate to controller I/O load balancing. Thus, more often than not, these arrays end up with unbalanced I/O loads.

The Active-Active model was conceived in the mid to late 1990s, primarily to enhance performance, back when controllers were under-powered, CPU and memory were expensive resources, multi-core systems didn’t even exist and low latency only existed in the backplane.

For an A/A/A system to maintain the same performance levels during a failover event, each controller must be running at less than 50% resource utilization. So if we think about it, an Active-Active Asymmetric system has the same amount of usable system resources as an Active-Standby one.

Additionally, in failover scenarios, some dual A/A/A controller arrays will disable write caching and revert to write-through caching, in which case every I/O becomes unbuffered and is acknowledged directly from media, thus causing further delays and latency.

Finally, software upgrades are scheduled events, typically on weekends or early morning hours when I/O activity is low, so as to minimize the degraded performance effect.

Summarizing the Active-Active Asymmetric model:

A mature model with a theoretical 2x processing power; complex to implement

Both controllers are active without balanced I/O load guarantees; therefore it’s easy to oversubscribe them

Performance impacts during failover unless each controller is utilized at less than 50%. (Technically, there would be no impact with one controller were over 50%, provided the other controller was under 50% by an equal amount, such that the failed over load was less than or equal to 100%)

Failover speed is dependent upon the failed controller’s load, as well as the surviving controller’s own load and available resources, in addition to SCSI stack timeouts and external hardware components such as switches

Due to the inherent inability to control per-controller load, it’s possible to experience host disconnects and I/O errors if SCSI driver and HBA driver timeouts are exceeded during failover events

Upon a failover event, it’s possible, in some arrays, that write-through cache operations will negatively impact I/O and latency in addition to the impact due to the surviving controller’s own load and current resource utilization

Software upgrades are performed during off hours to minimize performance effects

While most of the traditional storage array architectures have leveraged the Active-Active model (A/A/A to be precise), the majority of the new generation storage architectures that have entered the market in the last five years or so have elected instead to implement an Active-Standby model.

Active-Standby Arrays

In an Active-Standby model, there are two controllers or nodes with the following characteristics:

One controller runs in active mode, owns all of the volumes, services all I/O requests and provides data services such as compression, snapshots, replication, RAID, and so on

The partner controller runs in standby mode, ready to take over should the active controller experience a failure

This is a mature, simple, and straightforward model to implement, providing deterministic failover times without any performance impact after a failover, regardless of the load of the failed active controller.

An additional benefit of the Active-Standby model is that it eliminates the risk of prolonged degraded performance due to software upgrades. For instance, more than 60 percent of Nimble Storage customers upgrade their production arrays during regular business hours.

The Active-Standby model is sometimes confused with the much older Active-Passive model, but they’re vastly different. For instance, the Active-Passive model sometimes required manual intervention for a successful failover to happen; even in auto-failover mode, it was a lengthy process.

This contrasts with the Active-Standby model, where the Standby controller does actual work:

It has access to the same media as its active partner

It exchanges and monitors heartbeats with its partner

It always has an up-to-date copy of all inbound writes occurring

It sends write acknowledgements to its active partner, which in turn sends acknowledgements to the host

Those familiar with enterprise networking solutions will recognize that the Active-Standby model is implemented in many of these devices. For example, the Cisco Supervisor modules in the Catalyst, Nexus and MDS lines of Ethernet and Fibre Channel Enterprise Director class switches run in an Active-Standby configuration, as do many host-side clusters (WFC, HACMP, VCS, etc).

Provides the same deterministic performance level after a failover event

Eliminates elongated performance risk due to software upgrades. Such upgrades can be performed even during business hours.

Provides fast failovers (the idle controller has no existing load to service beyond its partner’s)

Larger effective NVRAM or write cache size – the system uses the entire controller NVRAM or write cache without splitting it for partner writes. That means 20GB of NVRAM in the Active controller actually holds 20GB worth of data, not 10GB + 10GB to store the partner’s mirrored writes.

Given the technical superiority of the Active-Standby approach, and given Nimble’s obsession with data reliability, it’s no surprise that it’s the approach our engineering team has taken.

At Nimble Storage, we’ve elected to implement the Active-Standby model in order to provide a simple, fast, predictable, and reliable model without having our customers worry about balancing LUNs, balancing I/O load, or wondering about controller headroom and what would happen to performance should a controller failure occur.