Data Deduplication And SSDs: Two Great Tastes That Taste Great Together

It should come as no surprise to even the most casual observer that data deduplication and SSDs have been the most significant technologies in storage during the past few years. Until recently, however, they have been applied to very different problems. The solid state storage story has been about performance, but data deduplication, while it has begun to sneak into primary storage, was about efficiency and was mostly relegated to secondary storage systems.

It should come as no surprise to even the most casual observer that data deduplication and solid state disks have been--along, of course, with everything cloud--the most significant technologies in storage during the past few years. Until recently, however, they have been applied to very different problems. The solid state storage story has been about performance, but data deduplication, while it has begun to sneak into primary storage, was about efficiency and was mostly relegated to secondary storage systems.

It turns out that, like chocolate and peanut butter, data deduplication and SSDs combine to create a whole greater than the sum of its parts. Most steely-eyed storage guys have been reluctant to implement data deduplication--or compression, for that matter--on their primary storage systems for fear that these data reduction techniques would rob them of needed performance. Solid state disks are expensive on a dollar-per-gigabyte basis but cheap on a dollar-per-IOPS basis. And if data reduction can squeeze more data on the same SSDs while only moderately reducing performance, that might be a trade-off worth making.

The truth is that inline deduplication will add some small amount of latency to disk writes as the system chunks, hashes and looks up to see if the data it’s dealing with is a duplicate or not. As anyone who has ever restored data from a deduplicated data store can tell you, it also can have an impact on read performance as data that’s logically written to the system sequentially is reassembled or, as it’s sometimes misleadingly termed, "rehydrated" from chunks written across the data store.

If we stored the deduplicated data across an array of solid state, rather than spinning, disks, the read performance problem would go away. This is because SSDs can respond to random read I/O requests just as fast as they respond to sequential I/Os. True, deduplicating and/or compressing the data might introduce 500us to 1ms of latency on writes, but since a typical commercial- or enterprise-grade SSD has write latency of less than 3ms, that’s still under the 5ms or so typical of a 15K RPM drive.

A mid-range commercial-grade multilevel cell (MLC)-based SSD, like Micron’s P400e or Intel’s 510, can deliver 8,000 4K IOPs. An array of 20 of these drives, using RAID 10, would therefore deliver about 80,000 IOPS. If our deduplication engine slowed them down by even 15%, the array would still deliver 68,000 IOPs, or the equivalent of 340 15K RPM spinning disks.

Then there’s the matter of write endurance. Most administrators’ major concern about MLC flash SSDs is that each page of flash can only be erased and rewritten 3,000 to 5,000 times. Deduplicating data reduces the amount of data that has to be written to flash, extending its life. Combined with array-wide wear leveling and log-based data structures that limit writes to full pages, deduplication can reduce the number of erase-rewrite cycles, thus extending SSD life.

Vendors including Nimbus Data, Pure Storage and Solidfire have included deduplication in all their solid state arrays. Some of these vendors advertise the cost of their system in dollars per gigabyte, assuming some level of deduplication, while others insist that practice is misleading. Either way, combining deduplication with solid state storage makes sense to me.

Disclaimer: Solidfire is a client of DeepStorage.net, Micron has provided SSDs for use in the DeepStorage lab, and Tom of Nimbus Data let me sit in the Lamborghini the company had in its booth at SNW. No other companies mentioned here have any commercial relationships with the author.

Howard Marks is founder and chief scientist at Deepstorage LLC, a storage consultancy and independent test lab based in Santa Fe, N.M. and concentrating on storage and data center networking. In more than 25 years of consulting, Marks has designed and implemented storage ... View Full Bio