Data Deduplication with Linux

After resolving all the more obscure dependencies, you're ready
to build and install the lessfs package. Download, build and install the
package using the same configure,
make and sudo make install commands
from earlier.

Now you're ready to go, but before you can do anything, some
preparation is needed. In the lessfs source directory, there is
a subdirectory called etc/, and in it is a configuration file. Copy the
configuration file to the system's /etc directory path:

$ sudo cp etc/lessfs.cfg /etc/

This file defines the location of the databases among a few other
details (which I discuss later in this article, but for now let's
concentrate on getting the filesystem up and running). You will need
to create the directory path for the file data (default is /data/dta) and
also for the metadata (default is /data/mta) for all file I/O operations
sent to/from the lessfs filesystem. Create the directory paths:

$ sudo mkdir -p /data/{dta,mta}

Initialize the databases in the directory paths with the mklessfs command:

$ sudo mklessfs -c /etc/lessfs.cfg

The -c option is used to specify the path and name of the configuration
file. A man page does not exist for the command, but you still
can invoke the on-line menu with the -h command option.

Now that the databases have been
initialized, you're ready to mount a lessfs-enabled filesystem. In
the following example, let's mount it to the /mnt path:

$ sudo lessfs /etc/lessfs.cfg /mnt

When mounted, the filesystem assumes the total capacity of the
filesystem to which it is being mounted. In my case, it is the filesystem
on /dev/sda1:

Currently, you should see nothing but a hidden .lessfs subdirectory
when listing the contents of the newly mounted lessfs volume:

$ ls -a /mnt/
. .. .lessfs

Once mounted, the lessfs volume can be unmounted like any other volume:

$ sudo umount /mnt

Let's put the volume to the test. Writing
file data to a lessfs volume is no different from what it would be to any
other filesystem. In the example below, I'm using the dd command to
write approximately 100MB of all zeros to /mnt/test.dat:

Seeing how the filesystem is designed to eliminate all redundant copies
of data and being that a file filled with nothing but zeros qualifies
as a prime example of this, you can observe that only 48KB of capacity
was consumed, and that may just be nothing more than the necessary data
synchronized to the databases:

If you list a detailed listing of that same file in the lessfs-enabled
directory, it appears that all 100MB have been written. Utilizing
its embedded logic, lessfs reconstructs all data on the fly when
additional read and write operations are initiated to the file(s):

Now, let's work with something a bit more complex—something
containing
a lot of random data. For this example, I decided to download the latest
stable release candidate of the Linux kernel source from http://www.kernel.org,
but before I did, I listed the total capacity consumed available on
the lessfs volume as a reference point:

And, because the databases contain the actual file and metadata, if
an accidental or intentional system reboot occurred, or if for whatever
reason you need to unmount the filesystem, the physical data will not
be lost. All you need to do is invoke the same mount command and
everything is restored:

Petros Koutoupis is a software developer at IBM for its Cloud Object Storage division (formerly Cleversafe). He is also the creator and maintainer of the RapidDisk Project. Petros has worked in the data storage industry for more than a decade.

Comment viewing options

Nice article. I work in an academic lab where we crunch massive amounts of data, and storage is always a huge headache for us. In the past we've had access to HSM storage management solutions, but the slowest tier has always been tape. It turns out that getting your data back from tape takes longer in some cases than just recomputing it, which already takes weeks on HPCs. It seems to me that if you could create HSM type solution with a fast parallel file system, like lustre, as the fastest storage tier and a compressed, deduplicated file system on slower, cheaper magnetic disks you might have a more reasonable, cost effecctive storage system for HPC. (I have not run any numbers though, an I'm not sure wahether yoou could build a system like this with OTS software/hardware.)

If you want to take advantage of de-duplication in your basement or development lab for your virtual machines you could consider using SmartOS as the underlying hypervisor platform. It comes with KVM as the hypervisor and ZFS as the filesystem. To enable de-dupe in ZFS it is simply: "zfs set dedup=on pool/filesystem", plus all the other awesome features of ZFS. Instant snapshots, clones, compression, etc. Then you can run your favorite GNU/Linux platform on top of it with de-duplication happening under the hypervisor. This ZFS de-duplication is all open-source and hails from the Illumos kernel.