Get out and push! Getting the most out of DFSR pre-staging

Hi, Ned here again. Today I am going to explain the inner workings of DFSR pre-staging in Windows Server 2003 R2, debunk some myths, and hand out some best practices. Let’s get started.

To begin, this is the last time I will say ‘pre-staging’. While the term is commonly used, it’s a bit confusing once you start mixing in terminology like the Staging directories. So from here in I will refer to this as ‘pre-seeding’ and hope that it enters your vernacular.

Pre-seeding is the act of getting a recent copy of replicated data to a new DFSR downstream node before you add that server to the Replicated Folder content set. This means that we can minimize the amount of data we transfer over the wire during the initial sync process and hopefully have that downstream server be available much quicker than simply letting DFSR copy all the files in their entirety over potentially latent network links. Administrators typically do this with NTBACKUP or ROBOCOPY.

How Initial Sync works

Before we can start pre-seeding, we need to understand how this initial sync system works under the covers. The diagram below is grossly simplified, but gets across the gist of the process:

Take a long look here and tell me if you can see a performance pitfall for pre-seeding. Give up? In step 6 on the upstream server, files need to be added to the staging directory before the downstream server can decide if it needs the whole file, portions of a file, or no file (because they are identical between servers). Even if both servers have identical copies, the staging process must cycle through on the upstream server in order to decide what portions of the file to send. So while very little data will be on the wire when all is said and done, there is some inherent churn time upstream while we decide how to give the downstream server what it needs, and it ends up meaning that initial sync might take longer than expected on the first partner. So how can we improve this?

How initial sync works with pre-seeding

First let’s take a look at how things will work on our third and all subsequent DFSR members in a Replication Group:

Since the staging directory upstream is already packed full of files, a big step is skipped for much of the process and the servers can concentrate on actually moving data or file hashes around. This means things go much faster (keeping in mind that the staging directory is a cache and is finite; the longer one waits, the more likely changes are to push out previously staged data). In one repro I did for this post, I found these results in my virtual server environment :

To determine replication time, I measured the difference between DFSR Event Log event 4102 and 4104 (like so):

Event Type: Warning Event Source: DFSR Event Category: None Event ID: 4102Date: 2/8/2008 Time: 11:40:35 AM User: N/A Computer: 2003MEM21 Description: The DFS Replication service initialized the replicated folder at local path e:\dbperf and is waiting to perform initial replication. The replicated folder will remain in this state until it has received replicated data, directly or indirectly, from the designated primary member.

55% faster is nothing to blow your nose at – and this is just a small amount of low latency data. If you take a very large set of data on a very slow link with high latency then base initial sync could take for example 2 weeks, out of which only 2 hours are spent to stage files and compute hashes, and the rest by sending data across the wire. In this case pre-seeding may be (1 week – 2 hours) / 1 week = 99% faster. As you can see, the fact that data was already staged upstream meant that we spent considerably less time rolling through the staging directory and didn’t spend most of our time verifying the servers are in sync.

To get the most bang for our buck, we can do some of the following to spend the least amount of time populating the staging directory and the most time syncing files:

Set the staging directory quota on your hub servers as close to the size of your data as possible. Since hub servers tend to be beefier boxes and certainly closer to home than your remote branches, this isn’t a problem for most administrators. If you have the disk space, a staging quota that is the same size as the data volume will give the absolute best results.

When pre-seeding, always use the most recent backup possible and pre-seed off hours. The less data that is in flux in the staging directory while we run through initial replication the better. This may seem like a no-brainer, but customers frequently contact us about slow initial sync that they started at 9AM on a Monday with a terabyte of highly dynamic data!

The latest firmware, chipset, network and disk drivers from your hardware vendor will usually give an incremental performance increase (and not just with DFSR performance). You wouldn’t dream of running your servers without service packs and security hotfixes – why wouldn’t you treat your hardware the same way?

Important Technical Notes (updated 2/28/09)

1. ROBOCOPY – If you use robocopy.exe to pre-seed your data, ensure that you use the permissions on the replicated folder root (i.e.c:\my_replicated_folder) to be identical on the source and target servers before beginning your robocopy commands. Otherwise when you have robocopy mirror the files and copy the permissions, you will get unnecessary 4412 conflict events and perform redundant replication (your data will be fine). The issue here is in how robocopy.exe handles security inheritence from a root folder, and how that can change the overall hash of a file. So using the command-line /COPYALL /MIR /Z /R:0 is perfectly fine as long as the permissions on the source and destination folder are *identical*. After pre-seeding your data with robcopy, you can always use ICACLS.EXE to verify and synchronize the security if necessary.

2A. NTBACKUP (on Win2003 R2) – If you use NTBACKUP to pre-seed your data on a server where it already hosts DFSR data on that same volume (i.e. you are going to use a new Replicated Folder on the E: drive, and some other data was already being replicated to that E: drive), and you plan on restoring from a full disk backup, you need to understand an important behavior. NTBACKUP is aware of DFSR; NTBACKUP will set a restore key under the DFSR services key in the registry (HKLM\System\CurrentControlSet\Services\DFSR\Restore\<date time> and mark the DFSR service with a non-authoritative restore flag for that volume. The DFSR service will be restarted and the Replicated folders on that volume will do a non-authoritative sync. This should not be destructive to data, but it can mean that you could see your downstream server become unresponsive for minutes or hours while it syncs. When DFSR was written the thought was that NTBACKUP would be used for disaster recovery, where you would certainly be suspicious of the data and DFSR jet database and want consistency sync performed at restore time.

2B. Windows Server Backup (Windows Server 2008 and Windows Server 2008 R2) – same as above but with newer tools. Do not use NTBACKUP to remotely backup or restore WIndows Server 2008 or later. This is unsupported and will mark files HIDDEN and SYSTEM, which you certainly don’t want…

3. XCOPY – The XCOPY /O command works correctly even without having the root folder permissions set identically, unlike robocopy. However it is certainly not as roboust and sophisticated as robocopy in other regards. So Xcopy is a valid option, but maybe not powerful enough for many users.

4. Third party solutions – be wary of third party tools and test them carefully before committing to using them for wide-scale pre-seeding. Thekey thing to remember is that the file hash is everything – if DFSR cannot match the upstream and downstream hashes, it will replicate the file on initial sync. This includes file metadata, such as security ACL’s (which are not calculated by tools that do checksum calculating). In Windows Server 2008 R2 beta, check out the DFSRDIAG tool to see how we have made this a bit easier for people. If you really need a file hash checking tool, contact us with a support case, we have some internal ones.

Wrap Up

Finally – I don’t have numbers here for Windows Server 2008 yet, sorry. I can tell you that DFSR behaves the same way in regards to the staging process. Based on the performance improvements made elsewhere though (specifically the 16 concurrent file downloads combined with asynchronous RPC and IO), it should be much faster, pre-seeded or not; that’s the Win2008 DFSR mandate.

Tags

Hi, Ned again. Today I’d like to talk about troubleshooting DFS Replication (i.e. the DFSR service included

10 years ago

mkielman

Ned – I am still very confused about pre-seeding.

Here are my questions:

1. Based on the graphic, in step #5, what is the downstream server sending to the upstream server? I am assuming a comparison is being made but I am not clear how that is happening.

2. We have tried to pre-seed by robocopying files to the destination servers, but they always end up in Conflict and Deleted. I am assuming it is because the timestamps aren’t identical. Could this be possible? We are not using the /copyall or /copy:S.

3. Overall it sounds like there are two things that can be done to "pre-seed". First, copy data (using robocopy or ntbackup) to destination server. Second, make sure that data exists in the stage directory on the upstream server. Is this true? How can you ensure the data exists in the staging directory on the upstream server?

1. It’s sending along hash and version vector info requests – i.e. "what specific changes do you have for me?"

2. Timestamps won’t matter, they are only used as a tie-breaker when two people edit the file on two servers *in between* replication. I’m not sure why you;re seeing this if you are using the right robocopy switches – do you see the same issue (as a test) using just XCOPY?

3. The first time, there’s nothing you can do – staging will just have to happen by walking the files upstream, in a linear fashion. The *next* (i.e. 3rd or later) server added will be able to make use of what is already staged upstream to make that process go quicker. The bigger the upstream staging, the faster you go. This is why for the ‘data hub’ servers that feed lots of branches, we recommend you beef up the disk space on that machine to allow as much staging as possible – ideally, an equal amount as the size of the data itself.

– Ned

10 years ago

mkielman

Hey Ned!

Thanks for your help! I was under the impression that data in the staging directory went away after replication. Is that not true? Does it stay as long as the quotas are surpassed?

Stays in staging until doomsday or quota being exceeded, whichever comes first. 🙂

10 years ago

mkielman

One more question (I hope)! I have two directories that are replication partners and one of them (the non-primary of the two) has more in the Staging directory than the other. They are both using the default stage folder size. Does this make sense that they wouldn’t be the asme size?

2. There have been conflicts (event 4112). Those can generate duplicate staged entries on a server that will not exist on the other.

10 years ago

mkielman

Ned – I have a situation where changes to two directories are not occurring and I found the following in the debug log:

Conflicting file was created by the same author

Do you know how I can work around this? Essentially what is happening is data administrator is updating files in one directory, renaming another directory so that she can rename the updated directory the same name as the other directory. Does that make sense? Example:

Dir1 – Original Data

Dir 2 – Updated Data

Rename Dir1 to Dir1.old

Rename Dir2 to Dir1

Anyway, there are no file handles open to either of those directories but this is not working due to that error in the logs.

Is this happening on Win2008 or Win2003 R2? I just tried reproducing this on 2008 with a little batch file and had no issues – does this look right for my repro:

@echo off

md E:robowakkasdir1

md E:robowakkasdir2

ren E:robowakkasdir1 dir1.old

ren E:robowakkasdir2 dir1

10 years ago

mkielman

Ned –

Thanks for the reply. This is with Windows 2003 R2. Yes that is exactly the process!

I am trying to look through the old logs to find that log entry but I can’t find it :/

So you don’t see any reason why that process shouldn’t work? What if there had been recent updates to Dir2 that were in the middle of replicating?

10 years ago

mkielman

Ok I found the full log entry:

20080824 11:00:17.635 6268 MEET 3294 Meet::GetNameRelated -> WAIT Name conflicting file was created by the same author updateName:Fv

10 years ago

rghabel

Ned, I have a question that I’m afraid to know the answer to.

I have a small setup, just two servers. They are connected by VPN and the branch site is simply using DSL for internet connectivity.

The branch site recently dumped 500GB of data onto their server. The propagation is happening and I see the VPN is always pegged at 90%, so I have no doubt that it will eventually finish (some time around Christmas!).

Question: is it possible to ‘re-seed’ to the primary with a copy of just this new data? For convenience, I thought I would simply run a NT Backup and put the backup file on an external hard drive. Walk up to the main server and extract during off hours.

Also, on a side question; above you mention not to use NT Backup because it waits to sync, would this wait time be omitted with a bounce of the server?

Sorry for delay in response, I’ve been away due to a death in the family for a week and change.

You could ‘re-seed’ by teraing out the replication group, pre-seeding data with a backup, then setting replication back again. NTBACKUP should be fine as long as the data is restored right to where you want it (i.e. not to some folder, then copied into the real folder, as you could be changing permissions if you’re not careful with your xcopy commands).

NTBACKUP should not be a problem here as you are not backing up the entire drive, just a folder.

10 years ago

anonymouscommenter

We’ve been at this for over a year (since August 2007), with more than 100 posts (127 to be exact), so

10 years ago

xxdcmast

I have read through your article and just want to say this is great stuff, really helpful in determining the innerworkings of DFS.

I do have one question about pre-seeding though, we are in the near future going to attempt a pre-seed. I was originally leaning toward doing it with NTBackup because I read about people having issues with robocopy. In your article you mention robocopy should work as long as you use the correct switches.

Could you please let us know what your preferred switches are when using robocopy.

The absolute *KEY* is that the permissions on the source and destination folder be *identical*. If they are not, file hashes will get all screwed up.

If you permissions are entirely inherited from the root (which is extremely unlikely), you can also just instead use

/MIR /Z /R:0

You can use ICACLS.EXE to make sure they match first. This is included in Vista/2008 later, and as a download from microsoft.com. You can also do the robocopy MIR, then run ICACLS across the entire data set making sure that they get synced with the source data.

– Ned

10 years ago

jasonh

Ned,

I have DFS-R replicating a share between two servers and need to move the location of the replicated data on one of the servers to another drive on the same system.

I can’t seem to find anything on moving from one drive to another, so I’m thinking the procedure might be to copy the data with robocopy to the new location, remove the existing replication for that server and then re-create the replication at the new location with the same share name.

I’m of course concerned about it replicating over the slow network connection, so I’d like to pre-seed the new location. Does this sound right to you? If I use robocopy to copy the data, do I leave the dfsprivate folder behind or copy it as well?

You suspect correctly. If you use robocopy to move it over on the computer to itself, it should work just like a pre-seed.

What I recommend is you do this as a dry run test with a tiny RG/RF and a couple files and folders. If you do not see any 4412 conflict events in your test, your method worked and you can go whole-hog with the real data.

Oh, forgot – yes, leave the dfsrprivate behind. It cannot be reused and DFSR will just waste time draining it out.

10 years ago

jasonh

Ned,

Thank you very much for verifying this for me! I will try it out on my test share that has 80mb of data in it and will use the robocopy options you specified above.

Thanks again!

Jason

10 years ago

hockeman

Hey Ned, I just tested your Robocopy /COPYALL /MIR /Z /R:0 and recived the event messages 4412 for the data… do you have any suggestions on what i could be doing wrong? I started with a blank directory on Server B and Robocopied the data with the switches from Server A. I have alot of data pre-existing and i’m trying to prevent having to bring the remote site servers back to my corporate office. I’ve been using another replication software and am unhappy with it however, the data is mostly consistent in all sites. Suggestions? I’m using Server 2008 on all servers. Thanks!!

And the folder being copied – it’s permissions were perfectly identical betwen servers before you ran the rocobopy command? That was the important point; if they differ in the slightest, you will run into that issue.

so:

c:somefolder <– on source server

f:somefolder <– on dest server

Permissions on that folder before running robocopy were exactly identical for both servers?

10 years ago

hockeman

Hey Ned! That was exactly the problem! Thank you very much! Just to give you a bit of history we have had numerous problems with DFSR and are attempting one last go prior to completely scrapping the product. The biggest problem we had was with the database crashing. I’ve read in at least one of your posts that you don’t recommend NIC teaming and I believe we’ll give that a go as all of our servers are teamed. Also are there any articles out there on recommendations for virus scanner exclusions? Lastly we are replicating roughly 3TB of data. Do you have any recommendations on how we should better manage that amount of data? In previous attempts we have had it split up into replication sets with the largest being 2TB. Yes, we are aware of the official Microsoft testing “up to” but not limited to one TB. Thanks again for your timely response!!

As for the database – are youusing Win2008? The code there for database reliability and recovery was improved extensively over the original version 1 release in 2003 R2. Regardless though, you should not be seeing database corruption unless there is something severely wrong with your servers hardware, power, or add-on filter driver software.

"System Volume InformationDFSR folders and their contents (includes DFSR.DB). This system-protected directory contains working files for the DFS Replication service. It should not be scanned because these files are always in use by the service.

<Replicated folder path>dfsrprivate folders and their contents"

That amount of data should be ok as long as it’s not highly dynamic. Any idea how many changes it gets an hour?

10 years ago

hockeman

Thanks for the AV info Ned. We are using Server 2008. I’m being told that there are a few hundred changes in an hour. I’ve started testing the pre-seed on existing data on server A and server B. My original test was to change perms on both folders to match exactly, run the robocopy with suggested switches and create the replication group. I’m receiving the 4412 messages for that folder. These files were originally copied by a third party software. If at all possible, I’d like to prevent having to robocopy the data all over again. Suggestions? In the previous post I asked about NIC teaming… was your response to that "add-on filter driver software"? Thanks!

You could explore using ICACLS.EXE /save and /restore to get a copy of ‘good’ ACL’s and sync them all up on the server, perhaps.

10 years ago

hockeman

Hey Ned, thanks for the suggestion however it didnt seem to make a differnce (very cool tool by the way). I did go through and compare a couple of the files on Server A and Server B and noticed that their modified dates are identical however their created dates are differnt. Would that make a differnce? If so, are there any tools that i could use to modify the created date on Server B?

Dates won’t matter, DFSR doesn’t care about that. Odd that the ICACLS trick didn;t work, it usually does. You might want to get a support case with us and have an engineer dedicated to your issue to really dig on your data here (in ways I can;t do through a comments section 😛 ).

10 years ago

hockeman

Hey Ned! Thanks for your response on my DFSR questions. I have one final question for you about pre-staging. I have created my replication group with pre-staged data and the replication has completed and appears to be stable (woo hoo). Now, I have one other server with pre-staged data that I’d like to add to the set. This pre-staged data is about a week old and I want to be sure that until its "prescan/reconcile" has completed that it doesn’t override my current production data in the current initial working set. Should i set one of the existing members to "primary" with the dfsr command tools until its finished with the initial prescan? Thanks again!

When a new server is added to an existing replication group, it will automatically be non-authoritative. If you want to minimize any risk though, I’d suggest resyncing that week old data with just the differences using robocopy – robocopy does this by default.

And if you want to be really, REALLY paranoid here, I;d suggest getting a backup of your known good data offhours and adding the new server off-hours. That way even if solar flares and elves caused any issues, you’d have a guaranteed out.

Setting primary would not be a good idea here, as any data on that server would overwrite data on your existing, working downstream server that might actually have newer data depending on the timing/latency of replication.

– Ned

10 years ago

hockeman

Thanks Ned! I appreciate it! If I need to ask this in another forum, please let me know but its concerning folder exclusions. I have added a wildcard folder exclusion *9600* and successfully excluded the folder from DFSR. The problem lies when I remove the exclusion and DFSR doesn’t resume replicating the folder. Any ideas?

FYI- I have exhausted the internet to better understand this… as with the other issues I’ve posted.

Thank you!!

10 years ago

Olliman

Hey Ned,

I read this blog with faszination and in this article I found something about the 4412 event ID which is currently my biggest problem. I don’t understand the reason why this problem occurs.

Unfortunately roaming profiles react very sensitive to a 4412 event. Lost desktop links or so one are typical consequences.

I read this article before. If more persons had opened a file concurrently, I would have understood the reason of the 4412 event ID. But it wasn’t so. The 4412 allways refers to someones profile path and this profile path is exclusively accessible for one person.

Example (german client operating system):

The DFS Replication service detected that a file was changed on multiple servers. Aconflictresolution algorithm was used to determine the winning file. The losing file was moved to theConflictand Deleted folder.

New Name inConflictFolder: Adressbuch-{0C51EB52-971E-46E8-84DC-55CD95F9B36F}-v172746.lnk

Replicated Folder Root: g:Profile

File ID: {3F477226-BA27-4101-B275-52697EDF7AE2}-v159131

Replicated Folder Name: Teacher

Replicated Folder ID: 20CDE110-FE6C-4EC0-9429-6290C8425D9F

Replication Group Name: htl-vienna5.schoolprofileteacher

Replication Group ID: 573DCEFB-8886-450E-A9CD-A99428348DE7

Member ID: 160F163E-F5C8-464C-B615-50CBFC1E9B57

Its impossible to access someone else profile path (excluding administrators), isn’t it?. The only logical explanation is that DFS sets another DFS server to active during the (work) session. The question is why?

Not if it’s being changed on the other server by an application running as SYSTEM, such as anti-virus software.

Turn on object access auditing on both nodes, SACL the user files/folders, and see if you can get some details on what’s changing it.

9 years ago

cortezj1_1

Quick question. I’m pre-staging from the master, and the root has localmachineadministrators. When it copies to the new machine (a domain controller) it switches to domainadministrators. Is this going to give me the dreaded 4412 error?