Infinidat: Enterprise reliability and performance

Reliability of a system is usually expressed as a percentage of uptime. A system that has an uptime of at least 99,9% should typically not exceed an unplanned downtime of roughly 8 hours and 45 minutes each year. ‘Five nines’ or 99,999% of availability is often used in IT: this equates to roughly 5 minutes of downtime on a yearly basis. For Infinidat this wasn’t good enough, so they built the Infinibox with a reliability of 99,99999%. That’s only 3.2 seconds of downtime per year. Yikes!

The basis for this incredible reliability lies in an extremely resilient and high performance base architecture of the systems. All the hardware in the box is designed in a N+2 fashion: two consecutive failures should not cause an outage of the system. This is not only valid for the internal power in the system, but also for SAS interconnects between the internal servers and the disk enclosures and for the disk groups which are 14+2 sets. The system does not use hotspares, but instead reserves enough capacity to guarantee rebuilds of up to 12 failed drives (of course only 2 drives in a set can fail at a single time).

The above results in a resilient system by itself, but components can still fail. I’ve seen it with plenty of newly installed systems: in the first couple months of a system going live, there’s an increased hard drive failure rate. Once the ‘weak’ drives have failed the hard drive failures stabilize, only to increase again once components starts wearing out. There’s an interesting paper on harddrive failures and MTTF over here.

To filter out as many early failures as possible, Infinidat puts every system through a 3 week long burn-in before shipping the InfiniBox to the customer.

There’s currently two types of InfiniBoxes: the F6000 and its smaller sibling, the F2000. The bigger system lists at about $1M, with the F2000 coming in at roughly $300k. So who buys an Infinibox? Mostly enterprise level customers that want to consolidate their existing high performance systems into one big, super reliable and fast array that can handle mixed workloads. Banks for example.

Both systems place DRAM and NAND SSD cache in front of a big pile of spinning disks to reach a throughput of 900K+ IOps and over 12GB/s. Capacity wise the system will scale to roughly 2PB in a single rack. Currently the arrays support NFS and FC, with support for iSCSI, SMB, Swift and FICON (yes, mainframes!) next on the roadmap.

My thoughts on Infinidat’s InfiniBox

The cost of an array always has to be placed in perspective to the cost of downtime. And the cost of downtime is comprised of many components: loss of image/reputation, loss of direct revenue, claims, failing to meet legislation, etc. Sure, there will be a lot of companies that do not want to spend $1M+ for a F6000 array. But there will also be plenty of companies that can’t live with an average storage downtime of 5 minutes each year. For them, 5 minutes of downtime might equate to $1M lost in revenue. So they need something that’s more resilient than a ‘5 nines’ system. Infinidat aims at those customers, and it does so with one hell of an InfiniBox.

Check out the presentations over here; the solution deep dive is pretty long but totally worth it if you want to get a better idea of the internals of the InfiniBox. Or read Vipin’s post over here

Disclaimer: GestaltIT paid for the flight, hotel and various other expenses to make it possible for me to attend SFD8. I was however not compensated for my time and there is no requirement to blog or tweet about any of the presentations. Everything I post is of my own accord.