The Use of Erasure Coding in VMware Virtual SAN 6.2

On February 10 2016, VMware announced the release of Virtual SAN (VSAN) v.6.2. In terms of features, this is the biggest release of the product since the initial release (v.5.5) in March 2014. One of the new features is data protection by means of erasure coding. There are two specific configurations supported in this version: RAID-5 for protection against one failure and RAID-6 for protection against up to two concurrent failures.

In the product and marketing material Erasure Coding and RAID-5 / RAID-6 are used pretty much interchangeably. A number of people have asked about the difference between RAID and Erasure Coding and what is actually implemented in Virtual SAN.

Erasure Codes or RAID?

So, let me set the terminology straight and clarify what we do in Virtual SAN 6.2.

Erasure Coding is a general term that refers to *any* scheme of encoding and partitioning data into fragments in a way that allows you to recover the original data even if some fragments are missing. Any such scheme is referred to as an “erasure code”. For a great primer, see this paper by J. Plank: “Erasure Codes for Storage Systems: A Brief Primer”.

Reed-Solomon is a group of erasure codes based on the idea of augmenting N data values with K new values that are derived from the original values using polynomials. This is a fairly general idea with many possible incarnations.

RAID-5 is an erasure code that is typically described and understood in terms of bit parity. But even this simple code falls under the Reed-Solomon umbrella: we are augmenting N bit values with a new bit value, which is computed using a trivial polynomial under binary arithmetic (XOR).

Figure 1: RAID-5 striping with 3 data + 1 parity fragment per stripe.

RAID-6 refers to various codes that are similar in function: they augment the data values with two new values and allow recovery if any one or two values are missing. The “classical” RAID-6 implementation is a Reed-Solomon code, which augments the parity in RAID-5 with a second “syndrome” which requires more complex calculations.

Traditionally, the latter calculations were slower which led to variations designed to avoid them, like Diagonal Parity. See the original paper by Corbett et al “Row-Diagonal Parity for Double Disk Failure Correction”. Today, however, the more complex calculations used by Reed-Solomon-based RAID-6 are no longer a problem. Modern CPU instruction sets (specifically SSSE3 and AVX2) can be leveraged in a way that makes these calculations almost as efficient as the simple XOR operations. For a reference on this, see the paper by Plank et al “Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions“.

In fact, we observed that performing Reed-Solomon calculations (Galois Field Arithmetic) using AVX2 is *faster* than performing simple XOR calculations without using AVX2! When we leverage AVX2 for both XOR and Reed-Solomon, the difference in cost (CPU cycles) between the two is under 10%. Virtual SAN 6.2 implements RAID-5 and Reed-Solomon-based RAID-6. It leverages SSSE3, which are present in all CPUs supported by vSphere, and AVX2 (present in Intel Haswell or newer).

Note that a Virtual SAN cluster size needs to be at least 4 host and 6 hosts, respectively. Of course, it may be larger (much larger) than that. Without making any commitments, I should state that if valid customer use cases emerge that justify additional RAID-5/6 configurations (or perhaps even other erasure codes), the Virtual SAN product team will consider those requirements. The Virtual SAN code base is generic and may support other configurations, if needed.

Space Efficiency vs. Performance

I would also like to highlight the key features and trade-offs of Erasure Coding and how it compares to replication, from a customer’s point of view. Obviously, the main benefit of Erasure Codes is better space efficiency than Replication for the same level of data protection. For example, when the goal is to tolerate one failure, the space overhead of a 3+1 RAID-5 configuration is 33% as opposed to 100% overhead with 2x replication (RAID-1). The overhead difference is even bigger between 4+2 RAID-6 (50%) and 3x replication (200%), when the goal is to tolerate up to 2 concurrent failures.

However, the space efficiency benefits come at the price of the amplification of I/O operations.

First, in the failure-free case, read performance is not affected. However, write operations are amplified, because the parity fragments need to be updated every time data is written. In the general case, a write operation is smaller than the size of a RAID stripe. So, one way to do this is to:

read the part of the fragment that needs to be modified by the write operation;

read the relevant parts of the old parity/syndrome fragments to re-calculate their values (need both old and new values to do that);

combine the old values with the (new) data from the write operation to calculate new parity/syndrome value;

write the new data;

write the new parity/syndrome value.

With 3+1 RAID-5, for a typical logical write operation, one needs to perform 2 reads and 2 writes on storage. For 4+2 RAID-6, the numbers are 3 reads and 3 writes, respectively. When Erasure Codes are implemented over the network, as is the case with distributed storage products like Virtual SAN, the amplification also means additional network traffic for write operations.

Moreover, in the presence of failures and while a storage is in “degraded” mode (some data and/or parity fragment missing), even Read operations may result in I/O and network amplification. The reason? If the fragment of data the application needs to read is missing, it needs to be reconstructed from the surviving fragments. In other words, Erasure Coding does not come for free. It has a substantial overhead in operations per second (IOPS) and networking. For traditional storage systems that used magnetic disks (which can deliver very few IOPS), big caches often using battery-backed NVRAM were a prerequisite to achieve reasonable performance. And they often needed to use very large numbers of spindles – not necessarily for capacity, but to meet the requirements for IOPS.

With Flash devices, RAID-5/6 is viable with entirely commodity components. Flash devices offer large number of (cheap) IOPS, so I/O amplification is less of a concern in that case. With Virtual SAN 6.2 and the new data reduction features it offers, All-Flash clusters may even result in more cost-effective hardware configurations than Hybrid clusters (Flash and magnetic disks) depending on workload and data properties.

In conclusion, customers must evaluate their options based on their requirements and the use cases at hand. RAID-5/6 may be applicable for some workloads on All-Flash Virtual SAN clusters, especially when capacity efficiency is the top priority. Replication may be the better option, especially when performance is the top priority (IOPS and latency). As always, there is no such thing as one size fits all. Virtual SAN allows all those properties to be specified by policies per VM or even per VMDK.

VMware offers design and sizing tools to help our customers determine what are the best hardware configurations and VM policies for their workload needs. But that’s a topic for another blog post.

Events

Comments

You mention the ability of Intel CPU to accelerate matrix operations for Erasure Coding. Do you specifically use ISA-L for EC or have you build your own EC algorithm for VSAN 6.2? Lastly have you benchmarked EC on non NVMe config vs NVMe config and if yes can you give us an indication about the impact of EC on performance for both types of config ? Regards.

Christos Karamanolis

February 18th, 2016

Hi Christophe,
Thank you for your comment and questions.
We are not using ISA-L in the current implementation. We have our own implementation that utilizes SSSE3 and AVX2. I do not have comprehensive comparison performance numbers for EC on NVMe vs other devices that I could share at this point. NVMe devices exhibit lower latencies, which is goodness, in general.