Posted
by
timothy
on Saturday March 27, 2010 @11:31PM
from the its-missing-apostrophes dept.

tazzbit writes "The storage vendors have been crowing about data deduplication technology for some time now, but a new open source project, Opendedup, brings it to Linux and its hypervisors — KVM, Xen and VMware. The new deduplication-based file system called SDFS (GPL v2) is scalable to eight petabytes of capacity with 256 storage engines, which can each store up to 32TB of deduplicated data. Each volume can be up to 8 exabytes and the number of files is limited by the underlying file system. Opendedup runs in user space, making it platform independent, easier to scale and cluster, and it can integrate with other user space services like Amazon S3."

......from what i can tell, this is NOT a way to deduplicate existing filesystems or even layer it on top of existing data, but a new filesystem operating perhaps like eCryptfs, storing backend data on an existing filesystem in some FS-specific format.

So, having said that, does anyone know if there is a good way to resolve EXISTING duplicate files on Linux using hard links? For every identical pair found, a+b, b is deleted and instead hardlinked to a? I know there are plenty of duplicate file finders (fdupes, some windows programs, etc), but they're all focused on deleting things rather than simply recovering space using hardlinks.

Actually, just the title did it. I've historically had a bad habit of backing things up by taking tar/gzs of directory structures, giving them an obscure name, and putting them onto network storage. Or sometimes just copying directory structures without zipping first. Needless to say, this makes for a huge mess.

Just occurred to me that it would not be difficult to write a quick script to extract everything into its own tree; run sha1sum on all files; and identify duplicate files automatically; probably in just one or two lines.

So in other words -- thanks Slashdot! The otherwise unintelligible summary did me a world of good -- mostly because there was no context as to what the hell it was talking about, so I had to supply my own definition...

I appreciate any deduplication solution for linux for sure, but isnt any deplucation creating a lot of shared ressources which could be possibly exploited for attacks (e.g. on the privacy of other users)?

Most likely in the implementation itself, not the de-duplication process.

Let's say user A and B have some file in common. Without de-duplication, the file exists on both home directories. With de-duplication, one copy of the file exists for both users. Now, if there is an exploit such that you could find out if this has happened, then user A or B will know that the other has a copy of the same file. That knowledge could be useful.

Ditto on critical system files - if you could generate a file and have it match a protected system file, this might be useful to exploit the system. E.g.,/etc/shadow (which isn't normally world-readable). If you can find a way to tell the deduplication happens, you can get access to these critical files for other purposes.

Note that you can't *change* the file (because that would just split the files up again), but being able to read the file (when you couldn't before) or knowing that another copy exists elsewhere can be very useful knowledge. But the de-duplication mechanism must inadvertently reveal when this happens.

Claim 1(a) requires "dividing an information item into a common portion and a unique portion".

It may be that the patent covers the case where the unique portion is empty, but then again maybe not, especially if the computer never takes the step to find out! In other words, if you treat every item as a common item (even if there is only one copy), there is a good chance the patent might not apply.

(There is also a good chance that the patent is written the way it is specifically because it doesn't apply to that case -- it may be that there is prior art in one of the referenced patents.)

It is one of those things that once you start using it, the benefits become apparent.

Here are some:

1) One application on one machine. No more wondering if application X has somehow messed up application Y. The writers of the software probably developed the application in a clean environment, and this lets you run it in a clean environment. Gets rid of vendor finger-pointing, too.

2) One application on one machine. If application X fouls the nest, you can reboot it and know that you are not also terminating applications Y, Z, A, and B.

3) Machine portability. The drivers in a VM guest are generic -and- uniform. Nothing inside the (guest) machine changes if you move the machine from a host with an Intel NIC to a host with a Broadcom NIC. The benefit here is that when hardware fails (and it will), it is pretty quick and easy to assign the boot disk to a different host, and boot the machine up. Think 10 - 30 minutes (per machine) to recover from a burned up power supply*.

4) Machine portability. There are some solutions that let you auto-fail-over to a new host when the guest stops responding. That burned up power supply could now be a two minute outage and NO emergency notification call.

5) Machine portability. Platespin lets you auto-migrate machines on a schedule to a few blades at night, power down those blades for power savings, and then power them up a little before business hours and migrate back. In a large data center, the electricity savings is enough to make it worth it.

6) Machine flexibility. Does application X not need much in the way of processing power? With the VM manager software, assign it one CPU and 256 MB RAM. Later find out that wasn't enough? Up the specs and reboot.

7) Reboot speed. In paravirtualized environments, the OS is already loaded in the host VM, so the guest VM just links and loads. I've seen entire machine reboots that take 16 seconds.

Almost every mission critical system these days is running in either a clustered or virtualized environment. I work in the financial services industry and there are many reasons we virtualize pretty much everything these days. These, however, are probably the biggies:

- Redundancy: If a physical machine dies, its virtual machines can be moved over to a spare, often with no interruption in service.- Isolation: Just because you can run multiple services on a box doesn't mean you should. It poses potential security problems (one compromised app can open the door to compromise another), makes managing users and resources more difficult, and the applications can interact or conflict in unexpected ways. Many vendors demand that their application be the only one running on a machine or they won't support it.- Portability: An OS configured for use on a virtual machine can be run on any platform which runs the virtual machine without modification.

ZFS also offers block-level dedupe support since ZFSv21. You can run it via FUSE on Linux, or natively on OpenSolaris. Hopefully, it'll also be available in FreeBSD 9.0 if not sooner (FreeBSD 7.3/8.0 have ZFSv14).

Since ZFS already checksums every block that hits the disk, dedupe is almost free, as those checksums are re-used for finding/tracking duplicate blocks.