Though Microsoft provided a tool that gives a generic preview of the results a system administrator can expect from this new mechanism (the tool I am talking about is ddpeval.exe), I think that sharing real world statistics on a filetype basis can greatly help the understanding of its usefulness.

So I have decided to use a bunch of disks in a one-to-one relationship with one type of file and enable deduplication at filesystem level.

I had in my stock four 50 GB SATA disks which I formatted with NTFS (remember that Windows 2012 new ReFS doesn't support Dedupe for the moment).

I run three series of tests.

In the first series of tests, I copied on each disk one single file and made enough copies of it to fill 10 gigabytes of space. The file extensions that I choose for this test are quite common files everyone has in his data: .avi for a movie file, .mp3 for a audio file, .doc for a Microsoft Word file and .iso for a disk image. These file extensions cover the types of files you find in media libraries, document libraries as well as software libraries and having an idea of what Windows 2012 deduplication can do for you is important.

On my first disk, named F:, I copied the .avi file whose size in 700 MB. I then made 15 copies of it in order to use 10 GB.

On the second disk, H:, I put the mp3 file whose size is 4.50 MB. I made 2291 copies of it in order to use 10GB.

On the third disk, L:, I stored a 1.75 MB large Microsoft Word document, then I made 5277 copies to take 10GB.

On the last disk, M:, I copied a pretty big 3.09 GB ISO image that I copied three times to take a little bit less than 10GB.

On these four disks I enabled Data Deduplication and waited for it to occur. A few days later I came back to see the results and they were as good as I expected, being those hundreds and thousands of files just replicas of the same original file and the probability of having identical blocks equal to 100%.

As you can see I got very very good deduplication performance for documents and music.

Here's a table where the difference between the theoretical size of all the files without deduplication (5th column) and the real size on the disk (6th column) can be easily compared. Also the last column shows the overhead of the deduplication mechanism. This is calculated this way: [Real used disk space MB]-[Size on disk MB].

Data type

File extension

Original file size MB

Number of copies

Theorical total size MB

Size on disk MB

Real used drive space MB

Dedup overhead MB

Music

.mp3

4.5

2 291

10 322.1

8.9

245.0

236.1

Video

.avi

700.3

15

10 504.8

0.1

937.0

936.9

WinWord

.doc

1.8

5 277

9 237.3

20.6

274.0

253.4

Disc Image

.iso

3 167.1

3

9 501.2

3 167.1

6 021.1

2 854.0

Here's the same results the way they are reported by Server Manager interface:

In the second series of tests, I used the same volumes above (which I formatted) to store many files of the same type on each of them. Here's what I did.

On volume F: I stored 52 different .avi test files that take 20.6 GB of space. their size varies between 100 MB and 1.5 GB.

On volume H: I put 3714 different .mp3 test files. Their total size is 17.6 GB.

On volume L: I put 773 Word documents. Sum size: 745 MB.

On volume M: I copied 9 .iso images taking 8.87 GB of disk space.

Once all the files where in their respective partitions, I enabled data deduplication and waited for all the optimization, scrubbing and garbage collection jobs to finish.

When I came back a few days later to see the situation, here's what I got:

On the other side, as you could expect, the optimization gain on .avi files is near to zero. Same for mp3 files. That's because the applications that writes this kind of files already eliminate redundant information and therefore identical blocks are highly unlikely. The theory says that I should be able to get no better results with pictures (.jpg, .jpeg) and other kind of compressed music files. Nonetheless, having Windows 2012 deduplicate your media library or picture library will allow you to have duplicate pictures, or films, or other kind of files on the same volume without necessarily wasting more space.

Let's imagine for instance that you have a set of personal photos and that you want to make a copy of some of them to a folder on the same partition and share them through some kind of web service. The amount of used space would be the same because the Deduplication Engine would be able to see that some block are replicated and replace them with a pointer.

In the third series of tests, this last statement is exactly what I aim to demonstrate.

On volume H:, where the mp3 are stored, I create a subfolder named 'copy of music library' and copied 1000 mp3 files from the root folder to it.

Unsurprisingly the disk space used did not increase at all. We passed from:

As you can see, the number of files is increased by 1000, but the free space stayed the same.

So, in the end, my opinion is that block-level deduplication is a nice improvement in the world of storage management both for home and professional use. Windows 2012 does a really good background job of seeking duplicate blocks and I encountered no errors at all for the moment. I have been running this for at least three months now and I am definitively happy with it.Feel free to contribute to this post by sharing your deduplication experience. I think an interesting debate can be had on this subject if many people pop in and share their thoughts.

I haven't hard of any limit for the moment. What do you mean by 'it refuses to dedupe'? Is there any event in the logs? Can you check and see if the fsmdhost.exe process is running? Maybe it's taking a while due to the size of data to analyze.Also try ddpeval.exe and see what it reports.RegardsCarlo

It just sits at a 0 rate and savings. FSMDhost.exe is not running currently, and ddpeval fails with:"ERROR: Evaluation not supported on system, boot or Data Deduplication enabled volumes", regardless of whether dedupe is on or off.

There is a VSS warning in the dedupe logs, which i am looking into.

Funny, it had no problem with a 60TB volume i had created prior to the bigger one...

Log Name: Microsoft-Windows-Deduplication/OperationalSource: Microsoft-Windows-DeduplicationDate: 2/5/2013 9:06:49 AMEvent ID: 4110Task Category: NoneLevel: WarningKeywords: User: SYSTEMComputer: dc-mgmt-02.vdc.localDescription:Data Deduplication was unable to create or access the shadow copy for volumes mounted at "K:" ("0x80042306"). Possible causes include an improper Shadow Copy configuration, insufficient disk space, or extreme memory, I/O or CPU load of the system. To find out more information about the root cause for this error please consult the Application/System event log for other Deduplication service, VSS or VOLSNAP errors related with these volumes. Also, you might want to make sure that you can create shadow copies on these volumes by using the VSSADMIN command like this: VSSADMIN CREATE SHADOW /For=C:

Hi Josh,I imagine you tried to issue the suggested VSSADMIN CREATE SHADOW /For=K: and it has failed right?If you issue the very same command on your 60TB vol what do you get? It would be interesting to understand if there is something wrong with your filesystem or if you are lacking disk space for fsdmhost to run.

As a side question, how long did it take to dedupe your 60TB volume? And hom much space where you able to save (also considering the file types)?

I found this:http://technet.microsoft.com/en-us/library/cc755419%28v=ws.10%29.aspxMaybe with a little math you can figure out what component is limiting VSS. It could be related to paged pool or non-paged pool exhaustion... which is what I would try to rule out at first.How much RAM do you have on that server?

I did try crating a shadow via the GUI, which failed. I currently have 32GB of ram on this server. More then enough to handle the dedupe and vss requirements. Thanks for that article, it may provide some good clues.

Dedupe on the 60TB was very quick when it was just a bunch of duplicate isos.

Thanks very much for sharing this information!! I was aware of the dependence upon VSS writer, but I am surprised by deduplication being limited to 64TB and Microsoft not telling us! Maybe they didn't bother testing their configuration maximums... waiting for someone else doing it on their behalf...

Hello, I'm curious about read performance on the de-duplicated volume. I'm familiar with DataDomain appliances, which are great with write performance but terrible at random reads. Did you happen to do any performance testing on the de-duplicated volume?

For the use case I'm considering, which is a backup repository for VM images, random read i/o is fairly important. We use Veeam Backup, and we want to use their Surebackup feature; this spins up the VM directly from the backup image in an isolated network. Our experience with DataDomain and this feature hasn't been very good, so we're looking into alternatives...

For the moment I didn't test read performance, but in theory there should be a 5-10% (some say 3%) overhead on seldom read files, while files in cache have a performance improvement (Dedupe engine has sort of caching system).

such a good read performance in Windows Dedupe engine is to to the algorithm behind the Master File Table, which is a B-tree.

In a few words, your Windows Disk has a list of files and folders which are organized in a hierarchical way so that no search takes i.e. 3 steps to find a given filename, then from there a first data chunk is read and for the rest of data a pointer tells you where they are on the disk.

Deduplication adds a new type of pointers (a reparse point specific to deduplication) which sends one or more files to the same sectors on disk.

So you see, there were pointers before and there are pointers now, and perf stays roughly the same.

For more, check this: http://www.happysysadm.com/2012/10/data-deduplication-in-windows-server.html or on wikipedia you'll find a good explanation of how a b-tree works.

Thanks for the interesting article. Have you by any chance had a look at the 2012R2 deduplication feature? I've read reports that it has improved quite a bit, especially in the amount of data it can process in 24h (so it is now able to handle larger files).

Hopefully stability wise as well - I also read of some people getting metadata corruption on the deduped volumes after a while.

Nice article. I've had similar numbers in my tests. The only thing that confuses me is the folder size that I see on folder properties. Example, folder with 22 MKV files, Size 6.87GB, Size on disk 0 bytes?!. On that disk there is more than 3000 AVI and MKV files. I really don't expect any savings on these types of files, so why is Windows reporting numbers like this? I cant find anywhere anything about this.

Since dedupe uses reparse points, what you're seeing is that all of the "files" in that folder are just pointers into the chunk store. In your case, there might be no savings at all from deduplication or compression, but the chunks are still in the store, so all the files are reparse points and size on disk shows zero. In general, even when size on disk is not zero, it's not a useful measure of how much space is used by the files, nor is it a useful measure of how much space was saved by dedupe.

With dedupe, you can't really determine how much disk space is "used" by a particular set of files if those files are in policy, since the chunks that make up that file will (hopefully!) belong to lots of files. But you can determine how much space you would reclaim if you deleted the files and then ran a cleanup job. To do that, use the Measure-DedupFileMetadata cmdlet in powershell: http://technet.microsoft.com/en-us/library/jj659278(v=wps.620).aspx