Data deduplication and backup and recovery: Jon Toigo's take on dedupe today

Weeks of amusement were provided early this summer to those of us watching from the sidelines as EMC Corp. and NetApp Inc. courted Data Domain. A bidding war, smallish by Wall Street mergers-and-acquisitions standards, ensued that resembled more than anything else an antiques auction. Creative minds could write a back

Data deduplication vendors can rant all they want about their regulatory compliancy, but the simple truth is that deduplication has never been tested in a court of law or administrative hearing. ,

story borrowed from old 1930s-era black and white film in which one bidder deliberately drove up the price for the other, but without any real intention of buying the goods -- the goal being to saddle the opponent with an exorbitant purchase price.

Truth be told, while Data Domain enjoyed the early lead in data deduplication products, it is old news now. Today, CA ARCserve 12.5, IBM Corp. Tivoli and, shortly, Symantec Corp., are/will be supplying data deduplication as a feature of their backup wares, rather than as a standalone product that stovepipes the dedupe function into an array of overpriced disk drives. Data Domain's claims notwithstanding, there is nothing inherently superior about its technology, compared to other dedupe schemes available in the market. Price and cost of ownership will ultimately be the determinants of product success, as they always are -- and, of course, flexibility will also factor in there somewhere.

One consumer I recently interviewed who had already bought Data Domain decided to buy a competitor's software-only solution on the next go-round because of difficulties inherent in scaling and managing multiple stovepipe arrays. Deduplication, like so many of the contemporary set of array "value-add" features, is increasingly understood to be a software service that needs to scale independently of hardware.

Data deduplication and disaster recovery

As it pertains to disaster recovery and data protection, the deduplication service is an enabling technology rather than an end in itself. To paraphrase FalconStor Software's CEO ReiJane Huai, it is essentially "waste management" technology. Huai observes that backup software (which he helped invent in the open-systems world) is extraordinarily wasteful. Many companies would like to store backups, not only on tape for removal and off-site storage, but also on disk, to facilitate the quick restoral of individual files that have become corrupt or deleted altogether due to machine or human fault. These kinds of file-level disasters comprise the preponderance of crises today, and a local, restorable copy of backups could make the difference between the small crisis remaining small, or blossoming into a full-out disaster scenario.

The problem is that backups consume disk space like my kids eat Pop Tarts. Say you have a terabyte of data to back up and you back it up in full four times a month (once a week, in other words): You are looking at 4 TB of disk to hold all the data, despite the fact that less than 10% of the data has actually changed from one backup set to the next.

Putting dedupe to the test

With deduplication, you can shrink that load down. We did some testing recently of CA ARCserve in our labs and came up with some great numbers. Assume the same scenario as above: The terabyte backup produced four times per month. For testing, we changed about 7% of the data in each backup set. The first full backup set was reduced by a modest percentage to 910 GB. (That "low" number was apparently due to the nature of the data in our backup set, some of which could not be reduced by very much. CA claims that, in their testing, they see between 50% to 70% reduction in the first backup.) Subsequent full backups were significantly reduced given their overlap with data in the initial full: to about 50GB each. The total disk requirement for storing 4 TB of raw backup data was a mere 1.06 TB.

Removing similarities in the data sets is the waste management function to which Huai refers, and it is fast becoming a fixture in contemporary organizations who want restorable data close to their production environments. Fielding this functionality on a specialty array, however, defeats the purpose of doing it at all from a cost standpoint. As capacity needs scale, as they inevitably do, hardware should scale independently of software so you can leverage commodity disk cost/capacity trends (bigger disks, lower prices per GB) instead of paying a premium for "value-add" array controllers -- thereby keeping the price of the data protection solution reasonable.

The other benefit of keeping the dedupe service separate from the hardware is flexibility. I have clients in the financial industry who want to segregate certain data from deduplication processes altogether for fear that the resulting data will not pass muster with regulators who want a "full and unaltered copy" of data that must be submitted to them by law.

The legal acceptability of deduplicated data was first questioned by NetApp a few months ago in response to a questionnaire I issued to all dedupe vendors that included this question. The company claimed that "in-line" deduplication had the potential to change data, unlike NetApp's preferred "dedupe after write." This statement was widely condemned by other deduplication vendors as self-serving FUD -- the nascent dedupe vendor community didn't want anyone trash talking their preferred technology.

Bottom line: Data deduplication vendors can rant all they want about their regulatory compliancy, but the simple truth is that deduplication has never been tested in a court of law or administrative hearing. My clients don't want to be the first test case.

The nice thing about software-based dedupe is that it can be managed from the backup server and applied selectively to different data sets targeted for backup. That is what I mean by flexibility, and that is what a software service should provide.

About this author: Jon Toigo is a veteran planner who has assisted more than 100 companies in the development and testing of their continuity plans. He is the author of 15 books, including five on the subject of business continuity planning, and is presently writing the 4th Edition of his "Disaster Recovery Planning" title for distribution via the Web. Readers can read chapters as they are completed, and contribute their own thoughts and stories, at http://book.drplanning.org commencing in August 2009.

Start the conversation

0 comments

Register

I agree to TechTarget’s Terms of Use, Privacy Policy, and the transfer of my information to the United States for processing to provide me with relevant information as described in our Privacy Policy.

Please check the box if you want to proceed.

I agree to my information being processed by TechTarget and its Partners to contact me via phone, email, or other means regarding information relevant to my professional interests. I may unsubscribe at any time.