Search This Blog

Observing Failover of Busy Pools

While looking on failover tests under load, we can easily see the system-level effects of failover in a single chart.

But first, some background. At InterModal Data, we've built an easy-to-manage system of many nodes that can provide scalable NFS and iSCSI shares in the Petabytes range. This software defined storage system scales nicely with great hardware, such as the HPE systems shown here. An important part of the system is the site-wide analytics where we measure many aspects of performance, usage, and environmental data. This data from both clients and servers is stored in an influxdb time-series database for analysis.

For this test, the NFS clients are throwing a sustained, mixed read/write, mixed size, mixed randomness workload at multiple shares on two pools (p115 and p116) served by two Data Nodes (s115 and s116). At the start of the sample, both pools are served by sut116. At 01:34 pool put115 is migrated (failed over, in cluster terminology) to sut115. The samples are taken every minute, but the actual failover time for pool p115 is 11 seconds under an initial load of 11.5k VOPS (VFS layer operations per second). After the migration, both Data Nodes are serving the workload, thus the per-pool performance increases to 16.5k VOPS. The system changes from serving an aggregate of 23k VOPS to 33k VOPS -- a valid reason for re-balancing the load.

Post a Comment

Popular Posts

Today, we routinely hear people carrying on about IOPS-this and IOPS-that. Mostly this seems to come from marketing people: 1.5 million IOPS-this, billion IOPS-that. Right off the bat, a billion IOPS is not hard to do, the metric lends itself rather well to parallelization...

This post is the first in a series looking at the use and misuse of IOPS for storage system performance analysis or specification.

Let's do some simple math. We all want low latency -- the holy grail of performance. In the bad old days, many computer systems were bandwidth constrained in the I/O data path, so it was very easy to measure the effect of bandwidth constraints on latency. For example, fast/wide parallel SCSI and UltraSCSI was the rage when the dot-com bubble was bubbling, capped out at 20 MB/sec. Suppose we had to move 100 MB of data, then the latency is easily calculated:

If you wander through the OpenSolarisZFS-discuss archives or look at the ZFS Best Practices Guide, then you can encounter references and debates about whether the zfs send and zfs receive commands are suitable for backups. As I've described before, zfs send and zfs receive can be part of a comprehensive backup strategy for high-transaction environments. But people get nervous when we discuss placing a zfs send stream on persistent storage. The reasoning is that if the stream gets corrupted, then it is useless. There is an RFE open to improve the robustness of zfs receive, but that is little consolation for someone who has lost data. The fundamental design of ZFS is exposed in zfs send -- the send stream contains an object, not files. This is great for replicating objects, and since ZFS file systems and volumes are objects, it is quite handy. This is why zfs send and zfs receive do not replace the functionality of an enterprise backup system that works on files. So, I expect the te…

ZFS now offers triple-parity raidz3. Conceptually, raidz3 is an N+3 parity protection scheme. Today, there are few, if any, other implementations of triple parity protection, so when we say "raidz is similar to RAID-5" and "raidz2 is similar to RAID-6" there is no similar allusion for raidz3. I prefer to say "raidz3 is like raidz2 with one additional level of parity protection. But how much better is raidz3 than raidz2? To help answer that question, I used the simple Mean Time to Data Loss (MTTDL) model to calculate the data retention capabilities of the possible configurations of 12 disks under ZFS. To be fair, the same model applies to other RAID implementations, but I'll use the ZFS terminology here.

In this MTTDL model, the configuration includes N total disks. If the data protection scheme is raidz3, then the minimum N = 1 data disk + 3 parity disks = 4. You can add more data disks to increase the overall available space, so if N=6 then you have 3 data…