Inline dedupe for performance as well as capacity

I’ve spent a lot of time talking about the results of a deduplicated array for storing VMs. I think it’s time to look at some differences between deduplicated arrays. Deduplication of storage has been around for a few years. Naturally there are different ways of deduplicating storage, each with positives and negatives. One of the first questions to ask about deduplicated arrays is: inline or post processed? Another is the scope of the deduplication? Another is where in the data path the deduplication is done?

The last two are fairly easy for arrays that are designed as primary storage for VMs. The deduplication is scoped for the entire array or at least a datastore. And the work of deduplication for live VMs is always done in the array. If the deduplicated store is designed for backups then it’s not unusual to scope deduplication to a single VM backup. This allows the backup file to be self-contained and transportable. Of course, this small scope doesn’t allow as much deduplication saving as a wider scope like a datastore or array. For backups, the deduplication can also be done by an agent inside the VM or by a proxy that sits between the VM to be backed up and its final backup destination. These approaches allow the network traffic for backups to be minimised, but they add latency to the process so aren’t good for running VMs. Dedupe arrays that are designed for running VMs have a wide deduplication scope for maximum efficiency and do all the deduplication in the array.

Post-Process Dedupe

Some deduplicated storage does post-processing for dedupe. The data is first stored without deduplication, then later it is examined and deduplicated. One of the primary reasons is that deduplication is hard work, so it can slow down writing data. To avoid poor performance data is quickly written to disk in full size, then later is deduplicated to recover capacity. One of the results is that during periods of high write activity the array’s free space gets used up fast. Then (some time later), when there is less activity, the deduplication frees up space. The fun part is that to analyse and deduplicate the data that has been written, it must be read. Meaning that the post processing impacts disk performance. So post-process deduplication suits arrays that already have ample performance or have inherent idle periods like backups. Post-process dedupe does suit all flash arrays, which have vast amounts of performance to handle the processing without impacting VM performance. Arrays with limited performance which use post-process deduplication are not suitable for running VMs, they are far more likely to suit backups.

Inline Dedupe

Inline deduplication for running VMs is tough. VMs are latency sensitive so adding tens of milliseconds of latency to their write IO is not acceptable. To do inline deduplication, you need dedicated resource. Either a lot of CPU time or some custom hardware that offloads deduplication from the CPU. A big benefit of inline deduplication is that it reduces the IO to the underlying storage. Whenever a VM writes data that matches an existing unique block there is no need to store the new copy. The most obvious case is when a dozen Windows VMs apply the same updates, they all write the same blocks. Inline dedupe reduces IO inside the array as well as space consumption. Bear in mind that the duplicate writes are also acknowledged back to the VM as soon as they are recognized as duplicate, so the duplicate writes are fast.

Global dedupe

One of the fun things about deduplication is that it gets more efficient with larger amounts of data. The more chunks of data that are stored the more likely it is that chunks will repeat, and so deduplication will save more capacity. If you think about deduplication of a single VM’s data you can imagine that some blocks will be repeated for internal backups of files. But if we dedupe across multiple VMs then data that is unique within the VM but repeated in each VM will also provide dedupe savings. Often this includes business data, such as when a VM contains a data warehouse, which is a duplicate of a transactional database. Also when we use clones of production VMs for development and testing the clones start out as entirely duplicate data. Deduplication across an entire array is going to provide the best efficiency.