ScaleIO Architecture and failure units

I had the opportunity to play with a new EMC product last week: ScaleIO. It’s definitely not a new EMC product (I troubleshooted the 1.31 version and EMC released 2.0 at EMC World 2016) but I just hadn’t had the honor to work with one of those systems yet. ScaleIO is a software-defined storage solution that uses the local disks in your commodity server and shares these out as block LUNs across the Ethernet. Which means this architecture can scale pretty well, both on capacity and performance, using hundreds (if not thousands) of servers and disks.

ScaleIO Architecture

The ScaleIO architecture consists of the following components:

MDM or Meta Data Manager. A ScaleIO system contains three or five servers with this role, which can coincide with the…

SDS or ScaleIO Data Server role. This is the component that takes the raw disk capacity from a server and presents it out to the…

SDC or ScaleIO Data Client; a device driver that’s installed on a server which can connect to the ScaleIO block devices.

In other words: the SDS servers make up the software-defined storage array, the SDCs are the clients that use this storage, and the MDM servers make sure everything runs smoothly and is manageable. Roles can be combined; e.g. a number of SDS servers can also have the MDM role. And if you want to utilize the remaining CPU & RAM capacity of that server, you can install the SDC and run a couple of applications on that server.

Data is stored on the ScaleIO system in a RAID1 mesh mirrored layout, with each piece of data stored on two (randomly selected) different servers. Volumes can be created either thick or thinly provisioned.

Connectivity between the nodes is always based on IP connectivity: most customers will use traditional 1Gbit or 10Gbit (preferred for performance), but it’s also possible to use an IP-over-InfiniBand (IPoIB) network. IPoIB is probably a couple of times more expensive, so if you use it: let us know if it’s worth it!

For management there’s a GUI and a powerful CLI that could really benefit from auto-completing commands when you press Tab a couple of times. There’s also a REST API, so feel free to automate to your heart’s content.

ScaleIO Failure unit

The system I had to troubleshoot suffered from a node failure a while ago and wasn’t rebuilding correctly. In a normal situation a rebuild (either for a disk or a node) should be progressing rapidly. In this case however, the rebuild was taking weeks and not progressing at all: you could see the rebuild processes running and shifting data between nodes, but no real progress.

It turned out that this was due to an incorrectly set spare percentage combined with a lack of free space. The ScaleIO system simply didn’t have enough space to rebuild all the data. This was fixed by moving some data off the ScaleIO system, which gave it some breathing room to finish the rebuild and balancing processes.

Best practice is reserving enough space to cope with the loss of the largest fault unit in the failure domain. In this case, the largest unit is one complete node out of the four node cluster. This means that we need to reserve 25% of capacity for rebuilds. This might sound shocking, but by the time you’ve got a 20 node cluster, you suddenly only need 5%.

After you PuTTy into the server and log in to the ScaleIO management environment (SCLI), you can use the following command to adjust the spare policy:

The GUI will update instantly and show the newly configured spare capacity. Or you could run:

scli –query_all

Since the configured spare capacity is a percentage of the total array capacity, this means that you will have to recalculate it after changing the ScaleIO grid configuration: if you add an SDS, the percentage should go down and if you remove an SDS, the percentage should go up.

My thoughts on ScaleIO…

The system I troubleshooted consisted of four Linux servers, each containing 6 solid-state drives. The migration of a single HyperV VM kicked off at close to 300MB/s, which I think is pretty impressive from a performance perspective. And that’s just one thread; I wonder what the system will do as soon as a multi-threaded workload starts hammering the disks.

It’s a shame that you currently lose 50% of capacity due to the RAID1 mirroring. Some sort of parity based protection would be welcome to increase efficiency, although this would most likely increase the load on the SDC and thus steal some CPU power away from the application.

ScaleIO is quite flexible: you can add and remove capacity on the fly, using SDS nodes that don’t have to be equal size. With the right hardware (high speed disks, 10GbE) the array is surprisingly quick. I’m not sure (yet) how the ScaleIO licensing and full solution costs compare to traditional arrays, but this looks interesting. You can test it yourself if you like by downloading a trial version!