Data Deduplication: A Tongue Twister Worth the Effort

Data Deduplication sounds like an excellent candidate for an HGTV reality show  companies drowning in a sea of redundant data receive a visit from two perky IT people who descend on their files like vultures on road kill with the promise of curing the company's duplicate data ills with a weekend and few thousand dollars worth of software.
Frugal Server Admin: From the Department of Redundancy Department, 'clean up your copies.' It's an easy way to save some cash.

I agree that it doesn't sound like the best premise for a new show, but if you're looking for a new money-saving topic to discuss at the conference table next week, toss out the concept of data deduplication. Yes, it's a mouthful to say but that mouthful might save you a handful  a handful of dollars, that is.

The data deduplication process involves removing copies of files and replacing those duplicates with pointers back to the original copy. Removing multiple copies frees up valuable storage space, makes backups smaller and faster, and reduces network traffic for over-the-network backups. Add the three together and you have significant money savings.

Usually the term "deduplication," refers to enterprise storage systems that house huge amounts of data harboring perhaps tens of thousands of duplicated files. The sheer number of files and possible copies of those files makes the task seem overwhelming, but fortunately, there is hope in the form of sophisticated software designed for this purpose.

A Tale of Two Methods

There are two types of data deduplication: source and target. Source-based deduplication takes place as the backup software processes the files prior to transfer to media. This means the deduplication software replaces your current backup software and strategy with one that examines file contents on the fly. As you might expect, source deduplication speeds aren't stellar (though still better than tape), but savings come in the form of less network bandwidth being consumed, due to fewer files being transferred, and reduced space on backup media.

Source deduplication can be used to backup remote office data without additional hardware at the remote site. Restore speeds outperform tape for equivalent amounts of data.

Target-based deduplication uses your current backup strategy to deliver data to a virtual tape library (disk). Deduplication occurs in the virtual tape library, and you can make tape backups from it for offsite storage.

Note: With a target-based implementation, you don't have to wait for an entire backup or deduplication process to complete before transferring files to tape.

Target implementations transfer data at a very high rate (100s of MB/s to 1GB/s+), but backup hardware and software at remote sites must remain in place. This method reduces the amount of storage required for backups but does not decrease network bandwidth during transfer, since all of the original files first travel unchanged to the VTL for processing.

Deduplication implementations, like other technologies, have tradeoffs. The superiority of one method over another is one of those hotly debated IT topics, but the "best" method for one firm may not be optimal fort another.

Which Method Is Best for You?

My own personal preference is source deduplication. My reason is simple: Source deduplication removes redundant data at the source (the original filesystem), which decreases the original storage volume and the volume of data that's backed up. This method provides significant savings on both ends of the data flow with the added bonus of fewer gigabytes flowing across network wires.

However, each situation requires special consideration before making a decision on one method over another. Source deduplication might serve one company well, while the target method provides the perfect backup solution for another. If your company has remote locations, source deduplication will likely make you happy, while a company with unlimited available bandwidth for backups will fall in love with target deduplication.

My best advice comes from the vendors themselves (see the Vendor Listing in the next section): Let the technology speak for itself. If you're considering moving to a deduplication backup strategy, pick two or three vendors for each type of deduplication, and have them come into your company and backup your data. It's your data, your location, your staff and your time to discover the best method. Don't rely on hearsay, marketing or opinion for your decision. Let the technology speak for itself.

Who's Who in Deduplication

The following products deliver a wide range of solutions for data deduplication, from software applications to appliances to full software and hardware backup and recovery systems.

If you're looking for backup solutions that work with virtualization products, such as those from VMware, most of these solutions fully support that technology. Select your backup strategy based on the other products with which you must work in your environment. Ask each vendor if your particular list of applications carries a certified rating for use with her product. If not, seek another solution.

Data deduplication saves money by sparing valuable storage capacity, lowering network traffic, decreasing backup and restore times and easing those backup and restore efforts at remote sites. Your data ranks at the top of your most valuable business assets list and its protection is paramount. Choose a solution that's both safe and cost-effective.

Ken Hess is a freelance writer who writes on a variety of open source topics including Linux, databases, and virtualization. He is also the coauthor of Practical Virtualization Solutions, which is scheduled for publication in October 2009. You may reach him through his web site at http://www.kenhess.com.