Dedupe anyone?

I think deduplication (block level, file level, object level, whatever) should rapidly become standard (and free) on ALL storage. It's just such a win on capacity as well as throughput over the network. Dedupe it all everywhere, and then we have less to sling around and less to store permanently. Then make mirrored/RAID/Cleversafe style copies of the deduped data as needed to protect from failures.

CPU cycles are cheap and constantly getting cheaper thanks to Moore's Law, so whether dedupe is software or hardware based and where you do it (initial creation, first tier storage, pre-backup, post-backup, archive, all of the above), will matter less and less. Even though storage capacity is still staying close to that curve as well, IMHO we will probably hit the ceiling on storage density more quickly than for MIPS/watt or MIPS/mm2 just due to the way disk and even flash are architected. Network bandwidth (especally WAN) is nowhere near those type of growth curves. Until we get multi-GbE wireless everywhere, it will be painful to download, backup and so forth all of these multi-GB files (think BR-DVD/HD-DVD, etc.). So we will have to minimize data movement more and more relative to the distance of transmission.... and dedupe is a great way to do that.

Of course I could be wrong, but it still plays into my previous posts, about setting up a "all possible worlds" type of storage and then just indexing into that array - it would basically be 100% read and no write after the (admittedly huge & lengthy) initialization. Imagine every single possible 4 GB (or 16, or 64...) file cached on a huge RAM or Flash array.... accessible at GbE or 10GbE latency at worst. Sweet! And if that was somehow done with qbits... all bets are off. Talk about your rainbow tables! So whoever invents the first 4 GqB (giga-quantum-bit) storage is going to have the NSA and DHS beating down their doors...

--
Disclaimer: I do not speak for my employer and all the opinions expressed herein are solely my own.

4 Comments

Dedup is another one of those buzz word, like virtualization, CDP or WAN optimization. Every vendor said they got it and each one has their interpretation of it. All those new technologies are good, as individual entity, but, with all emerging and overlapping of new technologies as a whole, I am not quite sure.....

Recently I looked into one vendor for CDP for Exchange. Few vendors that I looked at, basically it bundled in all those new technologies, including kitchen sink, whether you want it or not. Many solution come with the dedup, compression, encryption, WAN opt and on and on....

In this case, my shop got half of the technologies already. It does not make sense to get it twice. If my data already encrypted on disk, the WAN opt, the dedup etc etc, will add overhead to my existing infrastructure.

I am betting the industries not to give us point solution any more. Cut the buzz word, try to make all those new technologies work well together, we all will benefits from it.

I would think if the data is encrypted, it won't make a difference to dedupe, as long as you don't use different encryption keys for data in the same dedupe domain. In other words the encrypted data blocks will still hash down the same way as the unencrypted ones, unless the encryption is CBC mode (chained block cypher), which usually only comms encryption uses I believe. I think we would have to look closer at the various "at rest" data encryption methods to get a definite answer. I do agree the overhead could be a killer if you have to unencrypt - dedupe - reencrypt to see any dedupe savings. It's all a matter of storage bits vs. processor ops vs. network bandwidth - how do you balance intelligently? Seems like processors & storage are still on that Moore's law exponential curve, while network (especially WAN) has lagged way behind. So using storage & processing to crunch the data before it hits the network pipes, makes more & more sense as time goes on....

Hi All,
Chirag, the tutorial is very nice.
We have the dedupe for free in our storage products and the customers want use it. In many case, for example in VMWARE environment, the advantage are a lot, about 80% of saving space with NFS datastore and 70% with VMFS luns. In other case the advatage can be up to a 30 % of saving space, depend of kind of data. Not all data can be dedupe, for example movies are very difficult to shrink.
I think that it is a good new feature against the data explosion, but must be joined with other tools like ILM and archive.

Disclaimer: Blog contents express the viewpoints of their independent authors and
are not reviewed for correctness or accuracy by
Toolbox for IT. Any opinions, comments, solutions or other commentary
expressed by blog authors are not endorsed or recommended by
Toolbox for IT
or any vendor. If you feel a blog entry is inappropriate,
click here to notify
Toolbox for IT.

A blog about the insane ongoing explosion of data storage and networking that gives many of us lucrative employment. Some ...
more

A blog about the insane ongoing explosion of data storage and networking that gives many of us lucrative employment. Some hopefully informed speculation as to where it all might end up in the near future. And general griping about vendors, pricing, and the complexity of data management.
less

Receive the latest blog posts:

Share Your Perspective

Share your professional knowledge and experience with peers. Start a blog on Toolbox for IT today!