When you’re investigation and planning large repositories for data (backups, archive, file servers, ISO/VHD stores, …) and you’d like to leverage Windows Data Deduplication you have too keep in mind that the maximum supported size for an NTFS volume is 64TB. They can be a lot bigger but that’s the maximum supported. Why, well they guarantee everything will perform & scale up to that size and all NTFS functionality will be available. Functionality on like volume shadow copies or snapshots. NFTS volumes can not be lager than 64TB or you cannot create a snapshot. And guess what data deduplication seems to depend on?

Here’s the output of Get-DedupeStatus for a > 150TB volume:

Note “LastOptimizationResultMessage : A volume shadow copy could not be created or was unexpectedly deleted”.

Looking in the Deduplication even log we find more evidence of this.

Data Deduplication was unable to create or access the shadow copy for volumes mounted at "T:" ("0x80042306"). Possible causes include an improper Shadow Copy configuration, insufficient disk space, or extreme memory, I/O or CPU load of the system. To find out more information about the root cause for this error please consult the Application/System event log for other Deduplication service, VSS or VOLSNAP errors related with these volumes. Also, you might want to make sure that you can create shadow copies on these volumes by using the VSSADMIN command like this: VSSADMIN CREATE SHADOW /For=C:

Now there are multiple possible issues that might cause this but if you’ve got a serious amount of data to backup, please check the size of your LUN, especially if it’s larger then 64TB or flirting with that size. It’s temping I know, especially when you only focus on dedup efficiencies. But, you’ll never get any dedupe results on a > 64TB volume. Now you don’t get any warning for this when you configure deduplication. So if you don’t know this you can easily run into this issue. So next to making sure you have enough free space, CPU cycles and memory, keep the partitions you want to dedupe a reasonable size. I’m sticking to +/- 50TB max.

I have blogged before on the maximum supported LUN size and the fact that VSS can’t handle anything bigger that 64TB here Windows Server 2012 64TB Volumes And The New Check Disk Approach. So while you can create volumes of many hundreds of TB you’ll need a hardware provider that supports bigger LUNs if you need snapshots and the software needing these snapshots must be able to leverage that hardware VSS provider. For backups and data protection this is a common scenario. In case you ask, I’ve done a quick crazy test where I tried to leverage a hardware VSS provider in combination with Windows Server data deduplication. A LUN of 50TB worked just fine but I saw no usage of any hardware VSS provider here. Even if you have a hardware VSS provider, it’s not being used for data deduplication (not that I could establish with a quick test anyway) and to the best of my knowledge I don’t think it’s possible, as these have not exactly been written with this use case in mind. Comments on this are welcome, as I had no more time do dig in deeper.

To wave goodbye to 2012 I’m posting the latest screenshot of the easiest and very effective money saving feature you got in Windows Server 2012 than RTM in August. Below you’ll find the status report of a backup LUN in a small environment. Yes those are real numbers in a production environment.

If you are not using it; you’re really throwing away vast amounts of money on storage right this moment. If you’re in the market for a practical, economical and effective backup solution my advice you to is the following. Scrap any backup vendor or product that prevents it files of LUNs being duplicated by Windows Server 2012. They might as well be robbing you at gun point.

You can pay for a very nice company new years party with these savings

I wish you all a great end of 2012 and a magnificent 2013 ahead. In 2013 we’ll push Windows Server 2012 into service where we couldn’t before (waiting for 3 party vendor support and if they keep straggling they are out of the door) and work at making our infrastructure ever more resilient an protected. With System Center SP1 some products of that suite will make a come back in our environment. 10Gbps is bound to become the standard all over our little data center network and not just our most important workloads.

There is a small environment that provides web presence and services. In total there a bout 20 production virtual machines. These are all backed up to a Transparent Failover File Share on a Windows Server 2012 cluster that is used to host all the infrastructure and management services.

The LUN/Volume for the backups is about 5.5 TB of storage is available. The folder layout is shown in the screenshot below. The backups are run “in guest” using native Windows Backup which has the WindowsImageBackup subfolder as target. Those backups are archived to an “Archives” folder. That archive folder is the one that gets deduplicated, as the WindowsImageBackup folder is excluded.

This means that basically the most recent version is not deduplicated guaranteeing the fastest possible restore times at the cost of some disk space. All older (> 1 day) backup files are deduplicated. We achieve the following with this approach:

It provides us with enough disk space savings to keep archived backups around for longer in case we need ‘m.

It also provides for enough storage to backup more virtual machines while still being able to maintain a satisfactory number of archived backups.

Ay combination of the above two benefits can be balanced versus the business needs

It’s a free, zero cost solution

The Results

About 20 virtual machines are backed up every week (small delta and lots of stateless applications).As the optimization runs we see the savings grow. That’s perfectly logical. The more backups we make of virtual machines with a small delta the more deduplication can shine. So let’s look at the results using Get-DedupStatus | fl

A couple of weeks later it looks like this.

Give it some more months, with more retained backusp, and I think we’ll keep this around 88%-90% .From tests we have done (ddpeval.exe) we think we’ll max out at around 80% savings rate. But it’s a bit less here overall because we excluded the most recent backups. Guess what, that’s good enough for us . It beats buying extra storage of paying a wad of money for disk deduplication licenses from some backup vendor or appliance. Just using the build in deduplication mechanisms in Windows Server 2012 Server saved us a bunch of money.

The next step is to also convert the production Hyper-V cluster to Windows Server 2012 so we can do host based backups with the native Windows Backup that now supports Cluster Shared Volumes (another place where that 64TB VHDX size can come in handy as Windows backup now writes to VHDX).

Some interesting screen shots

The volume reports we’re using 3TB in data. So 2.4TB is free.

Looking at the backup folder you see 10.9TB of data stored on 1.99 TB of disk .

So the properties of the volume reports more disk space used that the actual folder containing the data. Let’s use WinDirStat to have a look.

So the above agrees with the volume properties. In the details of this volumes we again see about 2TB of consumed space.

Could it be that the volume might is reserving some space ensure proper functioning?

When you dive deeper things we get some cool view of storage space used.. Where Windows Explorer is aware of deduplication and shows the non deduplicates size for the vhd file, WinDirStat does not, it always shows shows the size on disk, which is a whole lot less.

This is the same as when you ask for the properties of a file in Windows Explorer.

Discussion

Is it the best solution for everyone? Not always no. The deduplication is done on the target after the data is copied there. So in environments where bandwidth is seriously constrained and there is absolutely no technical and/or economical way to provide the needed throughput this might not be viable solution. But don’t dismiss this option to fast. In a lot of scenarios is it is very good and cost effective feature. Technically & functionally it might be wiser to do it on the target as you don’t consumes to much memory (deduplication is a memory hog) an CPU cycles on the source hosts. Also nice is that these dedupe files are portable across systems. VEEAM has demonstrated some nice examples of combing their deduplication with Windows dedupe by the way. So this might also be an interesting scenario.

Financially the the cost of deduplication functionality with hardware appliances or backup software hurts like the kick of a horse straight onto the head. So even if you have to invest a little in bandwidth and cabling you might be a lot better of. Perhaps, as you’re replacing older switches by new 1Gbps or 10Gbps gear, you might be able to recuperate the old ones as dedicated backup switches. We’re using mostly recuperated switch ports and native Windows NIC teaming, it works brilliantly. I’ve said this before, saving money whilst improving operations rarely gets you fired. The sweet thing about this that this is achieved by building good & reliable solutions, which means they are efficient even if it costs some money to achieve. Some managers focus way to much on efficiency from the start as to them means nothing more than a euphemism for saving every € possible. Penny wise and Pound foolish. Bad move. Efficiency, unless it is the goal itself, is a side effect of a well designed and optimized solution. A very nice and welcome one for that matter, but it’s not the end all be all of a solution or you’ll have the wrong outcome.

Backing Up 100 Plus Terabyte of Data Cheaply

When dealing with large amounts of data to backup you’re going to start bleeding money. Sure people will try to sell you great solutions with deduplication, but in a lot of scenarios this is not a very cost effective solution. The cost of dedupe in either backup hardware or software is very expensive and in some scenarios the cost cannot be justified. It’s also not very portable by the way unless in certain scenarios in which you stick with certain vendors. Once you get into backing up > 100TB you need to forget about overly expensive hard & software. Just build your own solutions. Now depending on your needs you might want to buy backup software anyway but forget about dedupe licenses. Some of the more profitable hosting companies & cloud providers are not buying appliances or dedupe software either. They make real good money but they rather spend it on SUVs and swimming pools.

What Can You Do?

You can build your own solution. Really. You can put together some building blocks that scale up and out. You’ll a dual socket server with two 8 core CPUs and 24GB of ram, perhaps 32GB. Plug in some 6Gbps SAS controllers, hook those up to a bunch of 3.5” disk bays with 12 *2TB or 3TB disks each and you’re good to go. You can scale out to about 8 disk bays if you don’t cluster. Plug in a dual port 10Gbps card. You’ll need that as you be hammering that server. If you need more than this system, than scale out, put in a second, a third, etc. 3.5TB –4TB of backup capacity per hour in total should be achievable..

When you buy the components from super micro and some on line retailers you can do this pretty cheap. Spare parts you say? Buy some cold spares. You can have a dozen disk on the shelf, a SAS controller and even a shelf if you want. You could use hardware redundancy (RAID, hot spares) or use storage pools & spaces if you’re going the Windows Server 2012 route and save some extra money. Disk bay failure? Scale out so that even when you loose a node you still have tree others up and running. Spread backups around. Don’t backup the same data only to the same node. I know it’s not perfect for deduplication with Windows 2012 that way but hey, you win some, you lose some. Checks & balances right? If you need a bit more support get some DELL PowerVaults or the like. It depends on what you’re comfortable with and how deep your pockets are.

You can by more storage than dedupe will ever save you & still come out with money to pay for the electricity. Okay it’s less good for the penguins but trust me, those companies selling those solutions would fry a penguin for breakfast everyday if it would make them money. Now talking about those penguins, the Windows Server 2012 deduplication feature could be providing me with the tools to save them , but that’s for another post. I hope this works. I’d love to see it work. I bet some would hate to see it work. So much perhaps that they might even consider making their backup format non dedupable?

Tip for users: Don’t use really cheap green SATA disks. They’re pretty environment friendly but the performance sucks. My view on “Green IT” is to right size everything, never to over subscribe and let that infrastructure work hard for you. This will minimize the hardware needed and the performance is way better than all the power saving settings and green hardware. Which will ruin the environment anyway as you’ll end up buying more gear to compensate for lack of performance unless you’ll just suffer the bad performance. Keep the green disks for the home user’s picture, movie & music collection and use 2TB/3TB SAS/NL-SAS. Remember that when you don’t cluster (shared storage) you can make due nicely without the enterprise NL-SAS disks.

Now I’m not saying you should do what I suggest here, but you might find it useful to test this on your own scale for your own purposes. I did it for the money. For the money? Yup for the money. No not for me personally, I don’t have a swimming pool and I don’t even own a car, let alone an SUV. But saving your company a 100.000 or more in cash isn’t going to get you into trouble now is it? Or perhaps this is the only way you’re going to afford to back up that volume of data. People don’t throw away data and they don’t care about budgets you’d better be able to restore their data. Which reminds me, you will also need some backup software solutions that doesn’t cost an arm and a leg. That’s also a challenge as you need one that can handle large amounts of data and has some intelligence when I comes to virtualization, snapshots etc. It also has to be easy to use, as simple as possible as this helps ensure backups are made and are valid.

Are we trying to replace appliances or other solutions? No, we’re trying to provide lots of cheap and “fast enough” storage. Reading the data & providing it to the backup device can be an issue as well. Why fast enough? Pure speed on the target side is not useful if the sources can’t deliver. We need this backup space for when the shits really hits the fan and all else has failed.That doesn’t have to be a SAN crash or a SAN firmware issue ruing all your nice snapshots. It can also be the business detecting a mistake in a large data set a mere 14 months after the facts when all replicas, snapshots etc. have already expired. I’m sure you’ve got quality assurance that is so rock solid that this would never happen to you but hey, welcome to my world .