New Generation Storage Arrays

I recently read a blog by Josh Odgers about the the requirement for hardware support specifically availability of the storage controllers or lack thereof (link: http://www.joshodgers.com/2014/10/31/hardware-support-contracts-why-24×7-4-hour-onsite-should-no-longer-be-required/). So I wanted to share my experience with storage controller availability and how modern storage systems provide availability as well as performance. I have used examples of system I have worked on extensively (XtremIO) and also other vendor technology that I read up on (EMC VMAX and Solidfire).

Nutanix have a very good technology of “Shared Nothing” architecture. But not everyone uses Nutanix (no judgement here). I have also been told that companies mix and match vendors (whoever that is :P). Josh raises a couple of very good points with regards to the legacy storage architecture of having 2 storage controller processing various workloads in todays high performance low latency requirement world.
There are a few exceptions to the rule above, EMC VMAX, EMC XtremIO and Solidfire. All these Storage systems have more than 2 storage controllers. They all provide scale up and scale out architectures. (Note: If I have missed any other vendors with more than 2 storage controllers, let me know and I will include the same in the post). I think the new age storage systems should not be called SANs, because they are not similar to the age old architecture of SANs providing just shared storage. These days storage systems do some much more than just provide shared storage.
VMware acknowledge this and hence they are introducing vVols, which provide a software definition to the capabilities provided by the Storage Systems. Hyperconverged is easily the latest technology which in some cases is definitely superior to the legacy SANs, but its not doing to replace everything just yet.
Lets delve deeper into the process. Lets take the legacy storage architecture and see how it behaves with scale out /up and failure scenarios.
Lets say a new SAN has been provisioned for a project, which has a definite performance requirement and so has been commissioned with limited disk arrays. This is an active active array so both the controllers are equally used.

As you can see, the IOPS requirement is being met 100%, CPU and memory have an average utilisation of 20-25%. So, this carries on for a few months, when another project starts with more IOPS requirements and so more disk is added (traditional arrays: more IOPS = more disk).

As you can see, the average utilisation of the Storage controllers in the SAN has spiked to about 45-50% on both the controllers. So after a few months, another project kicks off or the current project scope is expanded to include more workloads, you can see where this is going. Lets say that the controllers are not under stress and are happily peddling along at an average 70% utilisation.

BANG ! One of the storage controller goes down due to someone or something being wrong.

So whats happened here is, until the faulty part is replaced, the IOPS requirement can’t be met by the single surviving controller, thereby spiking the CPU and memory utilisation so high that processing anymore becomes impossible.

This is where the new age Storage Systems (see I am not calling them SANs anymore) have the upper hand. Let me explain how.

Lets take an EMC XtremIO for example, each XtremIO node consists of the following components (you can also read about XtremIO in my previous blog posts).

XtremIO is made up of 4 components, XBrick, Battery Backup Units, Storage Controllers and an Infiniband Switch. The Infiniband Switch is only used when there are more than 1 xbricks. Each xBrick node consists of the Disk Array unit (25 eMLC SSD) with a total of either 10 or 20 TB of usable storage. That is before all the dedpulication and compression algorithms kick in and make the usable space close to 70TB on the 10 TB cluster and 48 TB on the 20 TB xBrick.

You can’t add more disk to the node, if you want to add more disk you HAVE to buy another XtremIO node and add it to the XMS Cluster. When you add more than 1 node to the cluster, you also get an Infiniband switch through which all the storage controllers in the Storage System communicate.

The picture above shows the multiple controllers in the 2 node XtremIO cluster. (Picture from Jason Nash’s blog). This can be scaled out to 6 node clusters and no limit of how many cluster you can deploy.

Each Storage Controller has dual 8 core CPUs and 256 GB of RAM. This by any means is a beast of a controller. And all the metadata of the system is stored in memory so there is no requirement ever to span the metadata into the SSDs. The traditional way of writing metadata is when the storage disks are expanded with multiple disk trays, the metadata is also written into the spinning disk, this not only results in the read or write of the metadata being slow, it also consumes an additional backend IO. When there is a requirement for thousands of IO, the system just goes into a deeper state of consuming more IOPS to read and write metadata.

So lets take the example from above, where during the second stage of the project lifecycle, more IOPS were required, if the space was a constraint, the additional xtremio node is going to double the amount of IOPS that will become available as well as providing an additional 70TB of logical capacity.

Even though there is still an effect on the surviving Storage Controller, the IOPS requirement is always met by the new age Storage Systems. This is partly due to the fact that there is a specific improvements made to the way metadata is accessed in these new age systems. Lets look at the way metadata is accessed in traditional systems.

As you can see, meta data is not just in the controller memory but also dispersed across the spinning disks. Regardless of how fast spinning disk is, its always going to be slower than getting metadata from the RAM.

Now lets look at how meta data is distributed in XtremIO.

As storage requirements expand, more controllers are added in whose memory metadata is stored.

Other Storage Systems

If we move away from XtremIO and take the EMC VMAX as an example, each VMAX 40k can be scaled out unto 8 engines. Each of these engines has 24 cores of processing power and has 256 GB of RAM. It can be scaled up to about 2 TB of RAM and 192 cores of processing for all 8 engines. There are a maximum of 124 FC front end ports across the 8 engines.

Another example of a very good storage system is SolidFire. Solidfire has scale out architecture across multiple nodes and scale up options for specific workloads. They start from about 64 GB RAM and end up all the way upto 256 GB of RAM.

So here we go, traditional SANs are few and far in between today. There are various kind of companies, who for various reasons use all kinds of vendors. While #Webscale is taking off quite well, Storage systems still have a place in the datacenter. And as long as there are storage companies start re-inventing storage systems, they will remain in the datacenter along side #Webscale.

PS: Before anyone says zoning is not mentioned, I will tackle how to zone in the next blog post or may be after I work out how to explain zoning. I am not usually involved in zoning but will find out and blog about it as well.

9 Comments

Just curious in an XtremIO configuration like you have above, if a controller fails, does the system mirror write cache to another controller in the cluster? (e.g NetApp “cluster” does not do this). Even if you are “wide striped” across multiple controller pairs with their own back end you’re only going to be as fast as the slowest part of the cluster, so if part of it is degraded badly because it can no longer cache writes(due to failed component) then that would be a bad thing.

I believe VMAX provides this level of protection(as does HDS VSP, HP 3PAR, and I assume IBM’s high end too), I don’t think XtremIO does but by no means am I an expert on either platform, though in my brief talks with EMC on this they haven’t tried to correct me.

As someone who has a 90%+ write workload it is a feature I have liked for a long time (even before I had a 90% write workload) to have in my 3PAR boxes (my latest is a 4-controller 7450 which sports up to 500TB of native flash before dedupe).

While I’ve never seen one or known anyone who runs them Fujitsu also has had a 8-controller capable system for many years (3PAR has had one for more than 10 years too).

I don’t have the information where the write cache’s are mirrored throughout the cluster controllers. This I imagine would be a massive overhead. In any case if the controller is so badly degraded then it won’t have any access to the underlying infra. I don’t know the algorithm which is used for calculating the write cache hit percentage per controller.

There really isn’t a concept of mirroring write cache per-se with XtremIO. Write cache is somewhat of an artifact of hybrid disk arrays – and consequently the associated need to protect that cache through mirroring across controllers in the case of controller failure.

With XtremIO, “the system uses a highly available back-end InfiniBand network (supplied by EMC) that provides high speeds with ultra-low latency and Remote Direct Memory Access (RDMA) between all storage controllers in the cluster. By leveraging RDMA, the
XtremIO system is in essence a single, shared memory space spanning all storage controllers.” Writes really just flow through a content engine, which is managed on metadata within the controller and persists straight on to the SSD’s within the X-Brick. “Every metadata update made on an XtremIO controller is immediately journaled over RDMA to other controllers in the cluster. These journals are persisted to SSD using an efficient write amortization scheme that coalesces and batches updates to more efficiently use the flash media and avoid write amplification. Metadata is protected on flash using XDP (XtremIO Data Protection) and other techniques. This is ultra safe and tolerates any type of failure, not just power outages.”

To get back to the heart of your question, XtremIO has a pretty elegant n-way design (for scale-out) to protect against controller failures which completely sidesteps the need for mirrored write cache protection schemes perpetuated in hybrid disk array architectures…

ok – thanks – yeah I agree, I think that is a situation where EMC would push the VMAX since it has that ability(if the customer requires such a feature anyway). It is one of the (many, I’m obviously biased in my own preferences, no denying that) many reasons I have liked the 3PAR architecture for some time now(customer since 2006). Here is a picture on how it works on a 3PAR cluster:

4-node and 2 node clusters are the same, just fewer connections (always 1 direct connection from each node to every other node in the cluster).

I have asked them about going higher than 8 nodes, and they said there is no technical restriction on the # of nodes, they believe the “failure domain” is big enough that going much beyond 8 controllers is not optimal. Which I find curious – many storage systems out there whether it is VMAX with their 8 engines(their current data sheet touts dozens of engines but VMAX 40k goes to 8 max for now), 3PAR with 8 controllers, Fujitsu with 8, even HDS BlueArc NAS with 8 controllers, I used to have an Exanet(company went bust in 2010, now owned by Dell) NAS cluster which maxed out at 8 — so many seem to “only” go to 8, to me anyway it is a funny coincidence. I haven’t noticed any “scale up” architectures that go beyond 8.

This is not exactly new. EqualLogic has been doing this for more than a decade.

I think the problem is that traditional storage vendors have invested so much in their legacy platforms the cost of transitioning to scale-out is prohibitive. This is why all the new vendors are able to bring scale-out to market. They can see that it’s a better fit for customer requirements and they don’t have to explain the cost of transition to a board or shareholders.

can’t believe you only think that EMC is the only vendor that has a multi-controller disk array, when dual-controllers have been the minimum standard in all enterprise arrays for the last 20+ years!
So add HP, NetApp, Dell, IBM, HDS to your list, and there are probably another 20 or so to consider as well

Thanks Gerald. As I said in my blog posts, I only have experience with EMC products. I mentioned Solidfire as well as I know some of the people who work in there and have been privy to the technology. Obviously not everyone can have exposure to all the storage products. The point of this post was to explain the differences between the legacy storage systems and the modern (multi controller) systems. I am sure most of the storage vendors have multi controller systems. I gave the examples of a few I worked with.

The way I read his blog post was systems that supported more than two controllers which is still quite rare (in a shared scale up architecture). NetApp, and Dell most certainly do not (not even NetApp’s latest flash ray does, and their scale out clustering is not really clustering, it’s more of a workgroup very similar to how a VMware cluster works – you can’t span a single VM across VM cluster nodes for example).

I am not versed in the Fujitsu DX8700 architecture to know how tightly coupled their 2-8 node clustering is.

HDS I don’t believe goes beyond two controllers though in their VSP platform the architecture is different/reliable enough that it is not required(though it generally costs more/much much more complex to manage – similar to EMC Sym I believe before VMAX).