Source and target de-duplication are very different, both in how they work and how you use them. Let's take a look at them.

Target de-duplication is what's found in Intelligent Disk Targets (IDTs), most of which are virtual tape libraries (VTLs). You continue using whatever backup software floats your boat (as long as the VTL supports it), and send your backups to your de-dupe IDT/VTL and it will de-dupe them for you. This reduces the amount of disk needed to store your data, but it does not change the amount of bandwidth needed to get the backups to the backup server. The de-dupe can reduce bandwidth usage if the de-dupe IDT/VTL can then replicate the de-duped data to another IDT/VTL in another location. Now you have an on-site and an off-site copy without making an actual tape. (If you want to make an actual tape, you can make it from the onsite or offsite IDT/VTL.)

Source de-duplication requires you to use different backup software on the client(s) where you want to use it. They may be in your data center and they may be in a remote datacenter, or they can even be a laptop. This client software talks to the backup server (that is also running the de-dupe backup software) and says "hey, I've got this piece of data here with this hash. Have you seen that hash before?" (This piece of data is a piece of a file, not the whole file.) If the server has seen that piece of data before, it doesn't send the data again; it just notes that there's another copy of that block of data at that client. That way, if a file has already been backed up by the backup server before (such as the same file being stored by multiple people), then it won't transfer that file across the LAN/WAN. In addition, if a previous version of a file has been backed up before, de-dupe will notice the parts of the file it has seen (and not back them up again) and the parts of the file it hasn't seen (and back them up). This reduces both the amount of disk required to store your data AND the amount of bandwidth necessary to send the data.

Source De-duplication

Advantages

Reduces bandwidth usage all over

Can protect a remote office without any hardware installed there (up to a certain amount of data)

Written by W. Curtis Preston (@wcpreston). For those of you unfamiliar with my work, I've specialized in backup & recovery since 1993. I've written the O'Reilly books on backup and have worked with a number of native and commercial tools. I am now Chief Technical Architect at Druva, the leading provider of cloud-based data protection and data management tools for endpoints, infrastructure, and cloud applications. These posts reflect my own opinion and are not necessarily the opinion of my employer.

Like this:

Related

Post navigation

11 thoughts on “Two different types of de-duplication”

ddierickxsays:

I think NBU6.5 will use source de-duplication, as it will have puredisk technology built in. I wonder how much more resources this will require from master and/or media servers.

it is also the only way to do efficient remote backups, but nobody ever talks about (remote) restores. prepping a server at the main site and then shipping it somewhere at the other side of the globe is not always a good/prefered way to do it.

for me the restore part is still one piece of the puzzle that needs to be solved.

You seem to suggest, rather explicitly, that dedupes are functioning on the file level. Not only am I pretty sure that’s not true (including not of Puredisk, at least, that’s not what they said at the EDPF in Minnesota in 2005), but I think it’d be pretty boneheaded to hash files rather than blocks.

I just re-read the article, and I could see where you think I’m saying that de-dupe is file-level, but that’s not what I’m saying. I just gave a duplicated file as an example. I’ll re-edit the blog entry and give another example.

BTW, There is file-level de-dupe and sub-file-level de-dupe. File-level de-dupe is also called CAS, or content-addressable storage, and yes — they do the hash at the file level.

But what we’re talking about here is sub-file-level de-dupe. This catches not only duplicated files, but also duplicated pieces of files (i.e. blocks) that have already been seen. So when you back up a spreadsheet every day because it gets updated every day, each day you should back up only the new blocks in that spreadsheet.

In a recent podcast (sorry, I forget which one), Curtis said “inline or delayed de-dupe doesn’t matter when selecting a backup appliance” (my words and estimation of Curtis’ intent). Although he went on at length to discuss some effects, I thought he left out a few important considerations.

First, some or all delayed de-dupe systems seem to require scheduling of a de-dupe batch process where the data or data location may not have full functionality during the de-dupe process.

Second, the de-dupe process is extremely resource (especially CPU) hungry. With the process running, the appliance may not perform reasonably with other functions during the process. of course this will change for the better over time. Inline de-dupe appliances are built to withstand high-resource consumption during backup … there really are no significant back-end processes.

Third, if the deduped data is sent “off site”, such as with a proprietary system sending the de-dupe information to a similar box, … the movement off site is necessarily delayed until de-dupe can be accomplished. No such delay is required if inline de-dup.

Fourth, if de-dupe is done “inline”, the process of getting data from its source to off site is simpler. Simple is good.

For these reasons, I see the vendors with “inline” de-dupe as having a significant advantage … one that shouldn’t be waived off as unimportant.

Of course, it’s possible that I didn’t realize my hearing aide batteries needed changing at the time I listened to Curtis’ podcast. 😉 … and my generalizations may well be worse than I think Curtis’ was. cheers, wayne

I did say that none of the features that people typically debate matter (inline vs post process, MD5 vs SHA-1 vs custom, reverse vs forward referencing, etc). What matters is:

1. How big is it? (i.e. how much disk do you give me and what de-dupe ratio do I get with my data?)

2. How fast is it? (i.e. how fast are backups, restores, and the overall de-dupe process?)

3. How much does it cost?

All of your arguments against post-process above are aiming at #2. My opinion is neither in-line or post-process de-dupe system can claim any kind of victory. Both have advantages and disadvantages that have to be tested out with your data and your servers. Then, when all that testing is done, you get to compare how big, fast, and expensive the systems are. THAT’s all that matters.

Having said that, I’d like to comment on some of your statements, as I think they represent common misunderstandings about the process. Instead of doing it here, I’m going to do it in another blog post.

Target/inline de-duplication doesn’t need custom/modified a backup agent i.e the backup agent is sending the incremental data without bothering about the duplicates. This saves storage but consumes but not bandwidth.

Source de-duplication has a modified agent and sends hashes before sending the data. This saves time, bandwidth and storage.

and BTW, the simple checksum matching technique used in puredisk is of no good use. A simple byte insertion can shift all the blocks.

I said "Source de-duplication requires you to use different backup software on the client(s) where you want to use it."

Also, you cannot completely dismiss Symantec like that. I realize you’re a competitor and you need to position against them, but they are absolutely NOT "worthless." I know of several very large installations that are very happy. Also, how you slam a very successful product with a beta product is beyond me.