High Performance Deduplication

Time and again multiple enterprise customers, especially those who are migrating from competing solutions, ask us about scalability of Druva inSync. Since the launch of v4.0, inSync has scaled exceptionally well, especially for large deployments. The software has succeeded where majority of competing solutions have failed or turned off deduplication.

About a week back, (on request of a large customer) we started testing one of the competing solutions. We tested the software for 1 million files of total size of 2TB, of which 48% was duplicate. Insync finished the backup in about 22 hours and the competing software is still backing up.

InSync doesn’t support any “integration” with deduplication, but the whole software was designed around the deduplication and CDP. There is NO flag to turn off dedupe and there never will be.

This article focuses on my thoughts on how Druva succeeds where majority of others fail.

Why Source Deduplication Fails to Scale for Majority Vendors ? The biggest bottleneck for performance scalability of deduplication is the random disk IO performance. Almost all dedupe systems include a database to store the block-hash index which needs to be checked for every hash check. A server class magnetic disk usually offers a latency of 8-12ms which restricts the hash matches to about 100/sec, throttling the dedupe performance drastically.

Now, when the data set is small the entire index can reside in memory and hence the hash checks as much faster. As the index grows, the I/O congestion brings down the software’s capacity to perform inline deduplication. Consider this: Just about 1000 users can create over 10 Billion blocks for backup. And checking them with a rate of 100/sec could take 3.21 years.

Learnings from Storage Guys Data domain had an interesting approach. They optimized their inline dedupe performance for backup streams. Since the backup was mostly for servers with few large files and the data streams were mostly long streams of data in tar format, Data domain used a simple index read-ahead algorithm to load the relevant parts of the index before the stream blocks hashes reached the server. Since the streams changed less than 10% across two simultaneous backups, the algorithm helped deduplicate them at a very fast pace.

Solid State Disks A simple solution to the random-I/O problem is using SSDs to store the index. Although we did tweaked/changed certain features to support SSDs but the solution wasn’t complete because of the size limitation imposed by them.

Two Step Approach for Druva: No-SQL + HyperCache The “Data Domain approach” did not work for us as our data was much more random and coming from different sources. But on the flip side we had much more knowledge of the data formats we were backing up. The first step towards scalability was to get rid of the inbuilt SQL database which imposed a lot of latency because of SQL query serialization and execution. We replaced PostgreSQL with Oracle no-SQL BDB as an embedded database, which improved the performance and much simpler to maintain.

The second major innovation was HyperCache – a selective in-memory cache of index. Hypercache constitutes of both a positive and a negative cache, which remembers and caches both the most probable and the least probable hashes for on-going backup. HyperCache uses an ever learning algorithm and uses different parameters like time, frequency and probability of a hash to cache it.

The Result The result was 85% reduction in disk I/O by using 4GB of RAM for every 1TB of data stored. The reduction in IO translates to 4X better scalability, and the solution can easily scale to thousands of users with linear improvement in scalability/performance.

Use of SSDs further improves the performance by 6X. InSync core has been modified to keep only the most concurrent part of the database index on SSDs and optimize it for solid state drives.