Top 10 Common Causes of Slow Replication with DFSR

Hi, Ned again. Today I’d like to talk about troubleshooting DFS Replication (i.e. the DFSR service included with Windows Server 2003 R2, not to be confused with the File Replication Service). Specifically, I’ll cover the most common causes of slow replication and what you can do about them.

Let’s start with ‘slow’. This loaded word is largely a matter of perception. Maybe DFSR was once much faster and you see it degrading over time? Has it always been too slow for your needs and now you’ve just gotten fed up? What will you consider acceptable performance so that you know when you’ve gotten it fixed? There are some methods that we can use to quantify what ‘slow’ really means:

· DFSMGMT.MSC Health Reports

We can use the DFSR Diagnostic Reports to see how big the backlog is between servers and if that indicates a slowdown problem:

The generated report will tell you sending and receiving backlogs in an easy to read HTML format.

· DFSRDIAG.EXE BACKLOG command

If you’re into the command-line you can use the DFSRDIAG BACKLOG command (with options) to see how behind servers are in replication and if that indicates a slow down. Dfsrdiag is installed when you install DFSR on the server. So for example:

This command shows up to the first 100 file names, and also gives an accurate snapshot count. Running it a few times over an hour and give you some basic trends. Note that hotfix 925377 resolves an error you may receive when continuously querying backlog, although you may want to consider installing the more current DFSR.EXE hotfix which is 931685. Review the recommended hotfix list for more information.

· Performance Monitor with DFSR Counters enabled

DFSR updates the Perfmon counters on your R2 servers to include three new objects:

DFS Replicated Folders

DFS Replication Connections

DFS Replication Service Volumes

Using these allows you to see historical and real-time statistics on your replication performance, including things like total files received, staging bytes cleaned up, and file installs retried – all useful in determining what true performance is as opposed to end user perception. Check out the Windows Server 2003 Technical Reference for plenty of detail on Perfmon and visit our sister AskPerf blog.

· DFSRDIAG.EXE PropagationTest and PropagationReport

By running DFSRDIAG.EXE you can create test files then measure their replication times in a very granular way. So for example, here I have three DFSR servers – 2003SRV13, 2003SRV16, and 2003SRV17. I can execute from a CMD line:

So around two minutes later our file showed up. Incidentally, this is something you can do in the GUI on Windows Server 2008 and it even gives you the replication time in a format designed for human beings!

Based on the above steps, let’s say we’re seeing a significant backlog and slower than expected replication of files. Let’s break down the most common causes as seen by MS Support:

1. Missing Windows Server 2003 Network QFE Hotfixes or Service Pack 2

Over the course of its lifetime there have been a few hotfixes for Windows Server 2003 that resolved intermittent issues with network connectivity. Those issues generally affected RPC and led to DFSR (which relies heavily on RPC) to be a casualty. To close these loops you can install KB938751 and KB922972 if you are on Service Pack 1 or 2. I highly recommend (in fact, I pretty much demand!) that you also install KB950224 to prevent a variety of DFSR issues – in fact, this hotfix should be on every Win2003 computer in your company.

2. Missing DFSR Service’s latest binary

The most recent version of DFSR.EXE always contains updates that not only fix bugs but also generally improve replication performance. We now have a KB article that we are keeping up to date with the latest files we recommend running for DFSR:

You would never run Windows Server 2003 with no Service Packs and no security updates, right? So why run it without updated NIC and storage drivers? A large number of performance issues can be resolved by making sure that you keep your drivers current. Trust me when I say that vendors don’t release new binaries at heavy cost to themselves unless there’s a reason for them. Check your vendor web pages at least once a quarter and test test test.

Important note: If you are in the middle of an initial sync, you should not be rebooting your server! All of the above fixes will require reboots. Wait it out, or assume the risk that you may need to run through initial sync again.

4. DFSR Staging directory is too small for the amount of data being modified

DFSR lives and dies by its inbound/outbound Staging directory (stored under <your replicated folder>\dfsrprivate\staging in R2). By default, it has a 4GB elastic quota set that controls the size of files stored there for further replication. Why elastic? Because experience with FRS showed us having a hard-limit quota that prevented replication was A Bad Idea™.

Why is this quota so important? Because if Staging is below quota – 90% by default – it will replicate at the maximum rate of 9 files (5 outbound, 4 inbound) for the entire server. If the staging quota of a replicated folder is exceeded then depending on the number of files currently being replicated for that replicated folder, DFSR may end up slowing replication for the entire server until the staging quota of the replicated folder drops below the low water mark, which is computed by multiplying the staging quota by the low water mark in percent (default is 60%).

If the staging quota of a replicated folder is exceeded and the current number of inbound replicated files in progress for that replicated folder exceeds 3 (15 in Win2008) then one task is used by staging cleanup and the three (15 in Win2008) remaining tasks are waiting for staging cleanup to complete. Since there is a maximum of four (15 in Win2008) concurrent tasks, no further inbound replication can take place for the entire system.

If the staging quota of a replicated folder is exceeded and the current number of outbound replicated files in progress for that replicated folder exceeds 5 (16 in Win2008) then the RPC server cannot serve anymore RPC requests, the maximum number of RPC requests being processed at the same time being five (16 in Win2008) and all five (16 in Win2008) requests waiting for staging cleanup to complete.

You will see DFS replication 4202, 4204, 4206 and 4208 events about this activity and if happens often (multiple times per day) your quota is too small. See the section Optimize the staging folder quota and replication throughput in the Designing Distributed File Systems guidelines for tuning this correctly. You can change the quota using the DFSR Management MMC (dfsmgmt.msc). Select Replication in the left pane, then the Memberships tab in the right pane. Double-click a replicated folder and select the Advanced tab to view or change the Quota (in megabytes) setting. Your event will look like:

Event Type: Warning Event Source: DFSR Event Category: None Event ID: 4202Date: 10/1/2007 Time: 10:51:59 PM User: N/A Computer: 2003SRV17 Description: The DFS Replication service has detected that the staging space in use for the replicated folder at local path D:\Data\General is above the high watermark. The service will attempt to delete the oldest staging files. Performance may be affected.

If your replication schedule on the Replication Group or the Connections is set to not replicate from 9-5, you can bet replication will appear slow! If you’ve artificially throttled the bandwidth to 16Kbps on a T3 line things will get pokey. You would be surprised at the number of cases we’ve gotten here where one administrator called about slow replication and it turned out that one of his colleagues had made this change and not told him. You can view and adjust these in DFSMGMT.MSC.

You can also use the Dfsradmin.exe tool to export the schedule to a text file from the command-line. Like Dfsrdiag.exe, Dfsradmin is installed when you install DFSR on a server.

The output is concise but can be un-intuitive. Each row represents a day of the week. Each column represents an hour in the day. A hex value (0-F) represents the bandwidth usage for each 15 min. interval in an hour. F =Full, E=256M, D=128M, C=64M, B=32M, A=16M, 9=8M, 8=4M, 7=2M, 6=1M, 5=512K, 4=256K, 3=128K, 2=64K, 1=16K, 0=No replication. The values are either in megabits per second (M) or kilobits per second (K).

And a bit more about throttling – DFS Replication does not perform bandwidth sensing. You can configure DFS Replication to use a limited amount of bandwidth on a per-connection basis, and DFS Replication can saturate the link for short periods of time. Also, the bandwidth throttling is not perfectly accurate though it maybe “close enough.” This is because we are trying to throttle bandwidth by throttling our RPC calls. Since DFSR is as high as you can get in the network stack, we are at the mercy of various buffers in lower levels of the stack, including RPC. The net result is that if one analyzes the raw network traffic, it will tend to be extremely ‘bursty’.

6. Large amounts of sharing violations

Sharing violations are a fact of life in a distributed network – users open files and gain exclusive WRITE locks in order to modify their data. Periodically those changes are written within NTFS by the application and the USN Change Journal is updated. DFSR Monitors that journal and will attempt to replicate the file, only to find that it cannot because the file is still open. This is a good thing – we wouldn’t want to replicate a file that’s still being modified, naturally.

With enough sharing violations though, DFSR can start spending more time retrying locked files than it does replicating unlocked ones, to the detriment of performance. If you see a considerable amount of DFS Replication event log entries for 4302 and 4304 like below, you may want to start examining how files are being used.

Event ID: 4302 Source DFSR Type Warning Description The DFS Replication service has been repeatedly prevented from replicating a file due to consistent sharing violations encountered on the file. A local sharing violation occurs when the service fails to receive an updated file because the local file is currently in use.

Many applications can create a large number of spurious sharing violations, because they create temporary files that shouldn’t be replicated. If they have a predictable extension, you can prevent DFSR from trying to replicate them by setting and exception in DFSMGMT.MSC. The default file filter excludes file extensions ~*, *.bak, and *.tmp, so for example the Microsoft Office temporary files (~*) are excluded by default.

Some applications will allow you to specify an alternate location for temporary and working files, or will simply follow the working path as specified in their shortcuts. But sometimes, this type of behavior may be unavoidable, and you will be forced to live with it or stop storing that type of data in a DFSR-replicated location. This is why our recommendation is that DFSR be used to store primarily static data, and not highly dynamic files like Roaming Profiles, Redirected Folders, Home Directories, and the like. This also helps with conflict resolution scenarios where the same or multiple users update files on two servers in between replication, and one set of changes is lost.

7. RDC has been disabled over a WAN link.

Remote Differential Compression is DFSR’s coolest feature – instead of replicating an entire file like FRS did, it replicates only the changed portions. This means your 20MB spreadsheet that had one row modified might only replicate a few KB over the wire. If you disable RDC though, changing any portion of a files data will cause the entire file to replicate, and if the connection is bandwidth-constrained this can lead to much slower performance. You can set this in DFSMGMT.MSC.

As a side note, in an extremely high bandwidth (Gigabit+) scenario where files are changed significantly, it may actually be faster to turn RDC off. Computing RDC signatures and staging that data is computationally expensive, and the CPU time needed to calculate everything may actually be slower than just moving the whole file in that scenario. You really need to test in your environment to see what works for you, using the PerfMon objects and counters included for DFSR.

8. Incompatible Anti-Virus software or other file system filter drivers

It’s a problem that goes back to FRS and Windows 2000 in 1999 – some anti-virus applications were simply not written with the concept of file replication in mind. If an AV product uses its own alternate data streams to store ‘this file is scanned and safe’ information, for example, it can cause that file to replicate out even though to an end-user it is completely unchanged. AV software may also quarantine or reanimate files so that older versions reappear and replicate out. Older open-file Backup solutions that don’t use VSS-compliant methods also have filter drivers that can cause this. When you have a few hundred thousand files doing this, replication can definitely slow down!

You can use Auditing to see if the originating change is coming from the SYSTEM account and not an end user. Be careful here – auditing can be expensive for performance. Also make sure that you are looking at the original change, not the downstream replication change result (which will always come from SYSTEM, since that’s the account running the DFSR service).

There are only a couple things you can do about this if you find that your AV/Backup software filter drivers are at fault:

So insidious! FSRM is another component that shipped with R2 that can be used to block file types from being copied to a server, or limit the quantity of files. It has no real tie-in to DFSR though, so it’s possible to configure DFSR to replicate all files and FSRM to prevent certain files from being replicated in. Since DFSR keeps retrying, it can lead to backlogs and situations where too much time is spent retrying backlogged files that can never move and slowing up files that could move as a consequence.

When this is happening, debug logs (%systemroot%\debug\dfsr*.*) will show entries like:

… someone has configured FSRM using the default Audio/Video template which blocks MP3 files and it happens to be against our c:\sharedrf folder we are replicating. To fix this we can do one or more of the following:

Make the DFSR filters match the FSRM filters

Delete any files that cannot be replicated due to the FSRM rules.

Prevent FSRM from actually blocking by switching it from “Active Screening” to “Passive Screening” by using its snap-in. This will generate events and email warnings to the administrator, but not prevent the files from being moved in.

XCOPY.EXE – Xcopy with the /X switch will copy the ACL correctly and not modify the files in any way.

Windows Backup (NTBACKUP) – The Windows Backup tool by default will restore the ACLs correctly (unless you uncheck the Advanced Restore Option for Restore security setting, which is checked by default) and not modify the files in any way. [Ned – if using NTBACKUP, please examine guidance here]

I prefer NTBACKUP because it also compresses the data and is less synchronous than XCOPY or ROBOCOPY [Ned – see above]. Some people ask ‘why should I pre-stage, shouldn’t DFSR just take care of all this for me?’. The answer is yes and no: DFSR can handle this, but when you add in all the overhead of effectively every file being ‘modified’ in the database (they are new files as far as DFSR is concerned), a huge volume of data may lead to slow initial replication times. If you take all the heavy lifting out and let DFSR just maintain, things may go far faster for you.

That’s a bug that was fixed in DFSR.EXE about a year ago. If you are still seeing this issue with Service Pack 2 installed or with latest DFSR.EXE (see http://support.microsoft.com/kb/931685) please let me know!

-Ned

11 years ago

Bobbi

Hi Ned,

Great information. I have downloaded the hotfixes you have mentioned because i have the same problem as Patrik with excel files.

I have two file servers each running windows 2003 standard R2 with service pack 2. They are replicated with one of them being the primary.

In you information you listed the \servernamedirectoryDfsrPrivateConflictAndDeleted files. I have one per directory mount point and the one in this directory is taking up 3.14 GB of disk space. The files go back to when we initially installed DSFR and continue to today’s date, so I don’t think it is going to automatically clean itself up. How do i clean these directories up so i can up have my disk space back?

Reading Patrik’s description of the Excel file problem, I would first want to rule out that we aren’t just dealing with file conflicts. If a file is updated on two servers before the file can get in sync again, DFSR handles that as a conflict, and the file that loses the conflict is moved to DfsrPrivateConflictAndDeleted in the root of the replicated folder on one of the servers, and it is renamed to filename-GUID-version.

You can test this with a command like:

echo foo > \std1d$datatest.xls & echo foo > \std2d$datatest.xls

In that command, std1 and std2 are DFSR members replicating the folder D:Data. The command creates the files simultaneously on both servers which results in a conflict that is logged as Event ID 4412 on one of the servers.

Event Type: Information

Event Source: DFSR

Event Category: None

Event ID: 4412

Date: 10/18/2007

Time: 10:40:25 AM

User: N/A

Computer: STD1

Description:

The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

Additional Information:

Original File Path: D:Datatest.xls

New Name in Conflict Folder: test.xls-{E3716117-034F-4998-A151-40DB382A4E4F}-v16188

Replicated Folder Root: D:Data

File ID: {E3716117-034F-4998-A151-40DB382A4E4F}-v16188

Replicated Folder Name: Data

Replicated Folder ID: 6939148D-3D46-4EDF-93FB-525061A91F2F

Replication Group Name: TESTRG2

Replication Group ID: F42975DB-33C5-4BC3-86E6-CAC21EF374E5

So first try to determine if these are just conflicts, and if not, we’d like to hear a detailed description of how the problem is reproduced in your environment.

For Bobbi’s second question, there is a WMI method CleanupConflictDirectory that can be used to purge the ConflictAndDeletedDirectory.

First you want to determine the GUID of the replicated folder whose ConflictAndDeleted folder you want to purge. This can be done with WMIC or Dfsradmin, but Dfsradmin is simpler.

dfsradmin rf list /rgname:testrg /attr:rfname,rfguid

In that command "testrg" is the name of the replication group that contains the replicated folder you are looking for.

Then you use the rfguid in a WMIC command to call CleanupConflictDirectory:

wmic /namespace:\rootmicrosoftdfs path dfsrreplicatedfolderinfo where

"replicatedfolderguid=’5B2BAE34-102B-4057-B8E5-EFE346D1FF19’" call

cleanupconflictdirectory

In the DFSR debug log (%windir%debugdfsr####.log) that will look like this –

Also, regarding the ConflictAndDeleted folder, I was assuming you had tried this but I’ll mention it anyway. If you double-click the folder on the Memberships tab in dfsmgmt.msc, go to the Advanced tab, you can reduce the Conflict And Deleted quota to as low as 10 megabytes. So another way to purge is to set that to 10 and restart the service, and it will purge down to the low water mark of 60% of 10 mb.

So that is a GUI method, but it appeared as if a service restart was needed for that to take effect immediately, although I imagine if I waited long enough it would run the cleanup thread and take into account the new 10 mb quota.

But the CleanupConflictDirectory WMI method works instantly.

11 years ago

Alasdair

Excellent article! Answered a lot of questions I had on DFSR.

I had been pre-staging using Robocopy but thought I’d try the Windows Backup instead having read the blog.

This issue is typically caused by an invalid registry value in the Restore subkey for the DFSR service. Look at:

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesDfsrRestore.

There will be a sub key named year-date-time the restore was done with two values. One of those

values will be the network name that was used to perform the remote restore.

– Backup and delete the restore subkey

– Restart DFSR (if it won’t stop, restart machine).

– After the reg value is removed the service start and stop will be normal

More Information

================================

When a restore is done to a DFSR server a registry subkey and a few values will get added to the registry on the target system so that DFRS can process the restore. A good entry must use a local drive letter. It should look like this:

You will notice difference in the drive letter on the local restore as opposed to the e$ on the network restore. DFSR does not know how to process e$ and as a result cannot continue. It will sit there and wait for the registry key to be corrected.

All replication will stop till this key is deleted.

To prevent this from happening in the future either perform the restore of the data on the target machine or delete the registry value after performing the network restore and restart the service.

This has been fixed in the next OS release WIndows Server 2008.

Let me know if this doesn’t take care of it!

-Ned

11 years ago

Alasdair

Hi Ned, appreciate your prompt feedback.

I actually ended up phoning MS support and the chap there told me exactly the same. So having deleted that key and restarting the services all is fine again!! 🙂 (that is a big load of my mind).

What I like even more was when I asked about why it happened he agreed that it was a bug, called me back a few moments ago and told me I wasn’t going to be charged for my support call.

So that has made my day! However it might be nice if this "known" bug was documented somewhere to save others having the same headache… Of course now a search should bring them to this thread, so all’s well that ends well.

Thanks again.

Alasdair.

11 years ago

turg77

Hi Ned,

I keep getting this when I do a dfsrdiag backlog check:

[WARNING] Found 2 <DfsrReplicatedFolderConfig> objects with same ReplicationGrou

pGuid=1CF848D4-0F43-4334-A5F7-0EF85F0754F5 and ReplicatedFolderName=departments;

It’s likely that the local XML cache for the DFSR has some duplicate entries. Try this:

1) On DFSR server that has the errors from the output run DFSRDiag POLLAD.

2) Stop the DFS Replication service

3) Go to the drive that holds the replica_ files for the RG such as F:System Volume InformationDFSRConfig and rename the replica_*.xml files to replica_*.old

4) Go to the C:System Volume InformationDFSRConfig and rename the Volume_*.xml files to replica_*.old

5) Start the DFS Replication service

Check in the replica_ drive (i.e.- F:System Volume InformationDFSRConfig) and C:System Volume InformationDFSRConfig for the new xml files and in the registy at HKLMSystemCurrentControlSetServicesDFRSAccess ChecksReplication Groups for the values pertaining to the RG as well as HKLMSystemCurrentControlSetServicesDFRSParametersReplication Groups

Re-run the DFSRDiag commands to verify the fix.

Let me know how this works out.

-Ned

11 years ago

turg77

Hi Ned,

Very good! Thank you much… I thought that was it, but I wasn’t sure it was safe to play with those files.

Sorry, one more thing… I ran into some replication issues when a drive failed. When I restarted the dfs service I had some weird algorithm issues. The files that were kept shouldn’t have been. The modified date was newer on the ones moved to "Conflict and Deleted." Anyway, I’ve run the latest dfsr.exe hotfix and decided to pre-stage to get everything back in order. Is there a way to clear the backlog so dfs starts fresh after a pre-stage?

Hmmm – are you using Trend Micro Officescan 7.X? We’ve seen issues where older files would get reanimated with that application running.

If you want to start fresh and remove your backlog, you can remove the replica set, get your ‘master’ data onto one box, then use NTBACKUP to create a BKF of it, move or delete the ‘bad’ data off the other server(s), then copy the BKF out to them and restore the data to the correct spot. Then you create the replica and choose the ‘master’ server as primary – then the data should all sync up, and since it’s indentical there shouldn’t be a long period before initial replication is done and you;re back in business.

If this all sounds nutso and dangerous, don’t hesitate to open a case with us here for backend support.

That was an interesting question. After a bit of source code review I can say definitely that this is not possible and RPC encryption cannot be disabled.

– Ned

11 years ago

Xav

Hello Ned,

Thanks for this really usefull and powerful article.

sorry for my english, but I’m French 😉

anyway: here is my question: I’m running a Windows 2003 R2 SP2 with latest patches. I decide to upgrade the antivirus NOD32 from version 2.7 to version 3 and I started directly to got a couple of bad message in the system event viewer (ID 14530) saying more or less: DFS could not access its private data from the Active Directory. Please manually check network connectivity, security access, and/or consistency of DFS information in the Active Directory. This error occurred on root Company.

After the reboot, this message disapears but I noticed than the CPU goes to 50% in use by the the dfsr.exe process. After 3 or 4 hours, the server was not available and it was not possible to print, to access file and even top connect on the server physicaly. A hard reboot was necessary.

After the uninstallation of the AV, it was still the same thing. Finally, i apply the hotfix 931685 and it seems that the server is now accessible 24/7 but still with the dfsr.exe occuping 50% of the CPU (on an HS21 Blade).

After investigating, I notice that the dfsr00100.log which seems to be the running log file is full of strange message like that:

If there are ton of different files listed in the debug log with that error (which I have not seen before – always just one file), you will need to hunt them down as well.

Bonne chance!

-Ned

11 years ago

Xav

Hi Ned,

You know what? Thanks a million 😉

I follow your suggestion and now, it’s perfect: the processor went back to 0 to 5% and the dfsr.exe is running normaly. In addition, the log files contains now "normal" data.

It was one file, those one you talk about. In fact, I tried before that to delete it but without stoping the dfsr.exe service before; that’s why it didn’t work.

Now, I certainly have to reinstall the anti-virus, but I’m not so confident 😉

Thanks again for your help: it saved a lot of time and stress.

Xav

11 years ago

Anonymous

Have you ever felt your DFSR infrastructure wasn’t quite replicating up to your expectations, but didn’t

11 years ago

Anonymous

If you’re using DFS-R, which is included with Server 2003 R2, the Microsoft Directory Services Team has put together an excellent post explaining the top 10 reasons why replication may be slow. In short, the top 10 are: Missing Windows Server 2003 Network

11 years ago

Agim

Hi,

I just found this link, and I wanted to ask something about DFSR if possible.

I’m replicating files between 2 sites, and in one way, it goes just fine, but when I start replicating in different direction, it starts replications, and in one moment is completely stuck. Staging Area is enough big. When I run BACKLOG in one of the replication group, there is onely 1 text file there, and doesn’t go. We have enough bandwidth 4 Mbps, and is almost empty. Servers in both sides are completely updated, even with DFSR.exe fix.

I one side is Windows 2003 R2 Ent 64 bit, and in other side Windows 2003 R2 Std 32 bit (this is server that doesn’t replicate.

When I run Diagnostics report, it says everything is OK, and no single error in DFS Log.

There’s no impact – DFSR uses RPC for all replication work, SMB is not used in any way (not even named pipes).

11 years ago

Tom Bell

Ned

I have 10 replicated folders in one replication group & I would like to move the staging directories for all 10 to a central location, for example, E:Staging. What’s the best way to do this & are there any issues that I should be aware of? Thanks

You should not share the same staging directory, if that’s what you’re meaning. So:

e:staging <– bad

e:stagingrf1 <– good

e:stagingrf2 <– good

e:stagingrf3 <– good

<etc>

Configuring the staging path to be the same for all replicated folders may lead to some problems during staging cleanup. We do not support this configuration even though it may seem to work. We’ve had some cases where this was done and there were bizarre parent-child relationship failures and blocked replication. Not fun to fix.

As far as changing it – you can just do it through DFSMGMT.MSC and it will all get created and used automagically. Once it has taken affect (after AD replication converges and DFSR polls), you can delete the old staging folders. Changing the staging path does not automatically move the contents to the new folder though, so you may see some slightly slower replication and reduced RDC efficiency for a while until staging starts getting filled again.

– Ned

11 years ago

Tom Bell

Thanks a lot Ned!

So, I take it that the existing content of the staging directories does not get moved to the new staging locations.

So, from a Best Practice Perspective, if you had to choose between keeping the staging directories in their default location or moving them to a new location (since each will need its own staging directory after all), which one would you recommend? Thanks

I’m currently replacing branch office file servers and at the same time starting to use DFS-R for getting data back to a central site. Historically we’ve used Roboocopy to move data from the old server to the new server (security and all) because of the /mir capability. That works nicely because you can re-sync prior to the swap out, very quickly. BTW, we’re going to Server 2008.

I came across this post that says that Robocopy has a bug that causes you not to be able to copy security. I’m going to open a ticket with MS, but thought I would post here with my 2 cents. You recommend using xcopy… Robocopy is now built in (finally) to the OS in Vista and 2008. I just type xcopy /? at a cmd prompt and what appear… "NOTE: Xcopy is now deprecated, please use Robocopy."

It’s not that robocopy completely fails to copy security, it’s that it sets the inheritance bit in such a way that the MD-5 checksum of the file changes. So while you have security working fine, apps that compare checksums will think the files are different.

Feel free to press for the fix in Robocopy if you have a Premier contract though (do not bother if you are calling in a credit card case, those cannot be escalated to bugs). The more contracted customers that call in on this issue, the more likely we are to cross the bar for a fix. I have also started this dicussion again internally to see if we can get more traction again against 2008 and Win7.

We do indeed have a premier contract so I figure it is worth a quick low priority web ticket to let Microsoft know that it affects customers.

I tried using Robocopy and it works fine, the only bad thing is it spams the log with conflict file messages (for every file).

For migrating file servers, it’s hard to beat robocopy with a /mir command so that you sync the bulk of the data prior to a switch out and then run it one more time once you take access away. Xcopy just doesn’t fit the bill for that type of operation.

Thanks for the great article and response. DFS-R is a quite impressive technology.

11 years ago

Reead

Hi

1) IS it possible to View files in replication queue or being replicated ?? Any free tools on the market?

2) A deleted folder in the DFSPrivateConflictAndDeleted folder, is it possible to know who originally deleted it in the share.

3) Is there a software or built in tool to know the history of use of a shared folder/File,

If I setup an initial backup type replication in which I select my branch office as authoritative and then run a health report should I expect to see a huge amount of backlogged sending transactions from the backup server (not branch server)? That scares me that the data on the backup server is older and thus the reason I want my branch server to be authoritative. These servers are both running 2008.

I’m gun shy here because we had some sort of event on the central backup server last week that seemed to cause a HUGE amount of sending transactions from our central server back to the branch servers. It seemed to affect some servers that were still in iniital replication. It is almost as if they forgot that the branch server was authoritative. We have since verified that indeed some old files made their way back to the branch office servers. No one has access to the central backup server, no mass changes were made, no ACL changes, etc. The only thing on the box is FCS agent and Veritas NetBackup.

Don’t know if it is related, but we can’t even stop the dfsr service without it timing out and terminating the process. This has the unfortunate side effect of causing a DB recheck that takes about an hour to run. I’m double and triple checked limits and such and feel we are well below them. We are replicating 33 servers (each with an inbound and outbound connection, so 66 connections) to this one 64 bit Windows 2008 server. The branch servers are 32 bit. There is approximately 5.5 Million files, with very little change rate. The jet database is 2.1 GB, which from reading some post on here doesn’t seem all that large.

I thought I would pass along something that occurred to me a little to late to help my situation very much.

Branch office to central server collection group. I robocopied the data with /copyall and thus inadvertantly changed all the files. You can still use the files to stage, but it will spam your logs with conflict messages and fill up your dfsprivate with conflict files.

Instead of pointing your replication group to those files, as prestaged files, simply copy your data to the same volume but do not point to them in your replication group (assuming you have enough space). Doing it this way, dfsr will still use those files to as seeds to populate the replication group (and thus still not copy all the data acros) but will not spam your log or dfsprivate area.

I believe this approach assumes you have enterprise on one end or the other so you get that nice cross file whatchamacallit thing goin’ on.

10 years ago

mkielman

Above listed the most common causes for replication problems. Regarding #6, I have a situation where I one of the servers is no longer receiving updates and the debug.log has a large number of the following entires:

None instrinsic to the SP itself. But you should have the latest versions of DFSR.EXE and NTFS.SYS on both servers to avoid issues that were bugs in both versions.

KB944804 and KB948833

– Ned

10 years ago

mkielman

Thank you for your help! It turns out that the Sharing Violiatons were causing replication to become backlogged. I excluded the directory that contained all the files that were constantly locked and replication caught up shortly thereafter. One thing to note, these files were considered "Locked" by the OS because they weren’t open for writing, however, the Application was locking these files which most likely prevent the event 4302 from being logged.

10 years ago

mkielman

Ned –

Is there a way to use WMI to obtain the current DFS backlog of a system? I have found the ‘getOutboundBacklogFilecount’ but it requires that I use the VectorName or something. I want this script to be automated and scalable so it would be ideal if it could run on each individual system and output that systems backlog, much like "DFSRDiag Backlog" except without the other information.

So this gives a good example of how it works. It also shows what we mean by passing in the VersionVector (as it automatically figures it out). No matter what you are going to have always figure out a few details about the servers and topology in question, so if you wanted that to get more automated you would need to modify the script to actually figure all that out (not trivial, but not super hard either).

I want to delegate the right to create namespaces & replication groups in Active Directory to a group of users. I want these users to be able to fully manage the namespaces & replication groups that they create but not the ones that other people have created. How can this delegation be done from within Active Directory system partition? I know how to delegate rights from DFS management console. Thanks

10 years ago

mkielman

Ned –

I am trying to understand if compression is used during initial replication, but I am unclear if that is the case. I understand that RDC is used to only replicate the deltas but that doesn’t affect initial replication unless pre-seeding has been performed. So, my simple question is: Is compression involved with initial replication?

Are you having issues doing this in the DFSMGMT.MSC console, under the delegation tab? If you create the RG/RF and then add your user/group to that which contains the specific person(s) who will manage that RG/RF, and don’t add the other users, and those users are not already domain admins, it would just work.

Or are you looking to somehow script this to do this outside DFSMGMT? That can be done with DFSRADMIN.EXE RG DELEGATE.

1. XPRESS Compression – this compresses files over 64KB and not excluded from compression by type. It’s similar to zip, but faster, more linear, and not as efficient.

2. RDC ‘compression’ – I hate that we call this compression, as it’s not compressing files, it’s compressing time and bandwidth. :/ This is (as you point out) we do block replication of ‘chunks’ of files.

If you pre-seed data, we *will* try to use RDC on those files. We will always use XPRESS when meeting the rules above.

As to your question, we do not yet have published supported limits for 2008; that scalability testing is still ongoing (as you can imagine, it is very time consuming to test replicating massive amounts of data to hundreds to servers). I can say that we have customer verified field experience with up to 26TB being replicated in Win2008.

– Ned

10 years ago

Tom Bell

Hi Ned

I would like to know more about cross-file RDC. Documentation states that cross file RDC takes place when a file is on source but not on the target & a similar file exists on the target. What does similar mean in this case? for instance, does it have to be EXcel to Excel, Word to Word, etc. to be considered similar? Thanks

This a sparse file (read more on MSDN if you like – it will appear to be *huge* on 2003, but that is an Explorer quirk. It’s size on disk will actually be quite small usually) which is used to store signature information for all the files that are in the replica set. By traversing this file with heuristics, DFSR can quickly find signatures that match blocks of data for RDC. By matching these signatures up to what the upstream server has sent, it can ‘recycle’ blocks of data from existing files that have matching data bits.

So for example: I create a word doc. And this word doc gets passed around for years, getting modified and monkeyed with and whatnot, to the point where various copies have a lot of similarity, but some individual differences. Cross-file can use those parts that didn’t change to save some bandwidth when a later version is replicated. It doesn’t really know about word, it just knows this file has some binary goo that is similar in some other files.

Hopefully this explained it well,

– Ned

10 years ago

Tom Bell

Thanks Ned. So, will cross-file RDC ever be used in the initial DFS replication? Or does it come in the picture once replication is complete & the authoritative flag is removed from the source.

(Sorry for delay, I had to head out for a family emergency last week).

If the data was pre-seeded, it could be used in intial replication.

– Ned

10 years ago

TSchaid

Ned,

Our implementation includes managing ntfs permissions for all of our remote file servers via powershell scripts updating GPOs (File System). DFS-R was complete before this was implemented. Following a refresh of the GPO, backlog files increases to nearly all or all of the files on the remote server. Do I have any options ?

So if I understand you – you reset all the securiy on all the replicated files and those files backlogged? Andthis happens via GPO so the security is reset every 90-120 minuites?

I’d expect to see a huge backlog if that were the case. Even though the files themselves won’t be replicated, the security metadata will be, and that could take some time on a large number of files – metadata counts as ‘file replication’ in DFSR backlog terms (i.e. there is somedifference between two serers that must be reconciled). My advice would be to… not do that. 🙂 Set security less frequently. Or set once and don’t worry about it after that.

– Ned

10 years ago

TSchaid

Wait a minute. Are you suggesting the implementation of the GPO is changing the security metadata each time the GPO is applied to the server even if no actual changes have occurred ? Did Microsoft develop Group Policy and DFS-R each in a bubble ? How could it be that Group Policy provides a very nice ability in which to manage file system security and yet this same ability will cause DFS-R to thrash for days ? The reason we have gone to an automated approach for file system security is to bring control to this very difficult to control environment. For large orgs with lots of file servers, this is a very daunting task. Regardless of where NTFS permissions are initiated, this seems like it will always be a big deal for DFS-R.

If we applied the same change on both sides, would the results be different ?

I am suggesting no such thing. If you have configured GPO and powershell (you give no details here) to reapply security by re-writing the security arbitrarily (i.e. removing the security – that’s a change, setting security – that’s a change), then DFSR will react to whatever the USN journal tells it to.

I suggest you carefully reproduce your scenario in a test environment both with and without your powershell scripts or GPO, whatever those are. We don’t have 1000 customers a day calling us about this issue, so you are likely in a corner case because of how you are implementing things.

10 years ago

Tom Bell

Hi Ned

I am having problem replicating PST files. In a previous posting by Jill Zoeller, she mentioned the following:

[[[Outlook 2003. Contact Microsoft Product Support Services to obtain the

That post from Jill is regrettable – we do not support replicating PST files that are actively opened from network shares. Even though you can make the PST registry hacks like you mention above, you are in an unsupported position.

Although DFS Replication does not explicitly omit Outlook Personal Folders Files (.PST) from replication, .PST files that are accessed across a network can cause DFS Replication to become unstable or fail. DFS Replication can safely replicate .PST files only if they are stored for archival purposes and are not accessed across the network using a client such as Microsoft Outlook (copy the files to a local storage device before opening them).

(ps: not sure about the registry entry you mentioned, I doubt it’s changed between versions if it’s supported though – you’d have to ask the Office folks)

10 years ago

TSchaid

Ned,

How would you think Powershell is in the equation ? If I use GPMC and edit a GPO, a change is recorded. This change is then replicated to all DCs. The real question is how is the change applied to the destination server ? Is it a rip and replace or is it applying the changes only. By looking at the winlogon.log, we see all of the file system security entries from the GPO. My suggestion would then be it is replacing security on all listing folders.

A quick google shows at least one other customer who faced this same problem. He; however, was not using GPO. He was simply changing file permissions using the GUI. As for testing in the Lab. Really there is no difference.

So my question remains. If I apply the same security settings to the DFS-R destination, will this reduce replication traffic ?

Because you said: "Our implementation includes managing ntfs permissions for all of our remote file servers via powershell scripts updating GPOs (File System)". I didn’t know what you meant by that – could be startup scripts deployed by GPO that update security, for example. Remember that I only know what you tell me here, I don’t have any familiarity with your environment.

So I just attempted a repro of this – I created a GPO that ACL’ed my replicated folder. I made sure the GPO was set to ‘replace all existing permissions’ mode. I forced policy to apply – at this point the permissions were a match, and there was no additional replication. Then I manually changed permissions on the replicated folder, and force policy to apply – security was replicated from that server, as would be expected as they did not match between servers. Then I forced policy to apply again – no replication occured because the security already matched. Does this match your repro steps? I only see replication when the security does not match, regardless of GPO, as one would expect. If the security matches, nothing happens.

– Ned

10 years ago

Huw

Hi,

Is there a way to delete unwanted files from the ‘Pre-Existing’ folder? I don’t seem to have (or be able to give the administrator) sufficient permission to delete them?

Yes, you would ordinarily just need to be a member of the Administrators group – by default it is ACL’ed with full control on that folder. If not, an administrator would need to give you rights to delete that folder. And if Administrators is not actually set for full control… well, someone has been changing things in there!

– ned

10 years ago

Huw

Thanks Ned, what are the default security settings for the SVI folder? On the 2008 servers I have here SYSTEM full control only with no access to administrators, could be something coming from a group policy?

By default, it’s SYSTEM only for the System Volume Information folder(s) on Win2008. That was done intentionally, as we saw a lot of customers accidently deleting/damaging security in Win2003 R2 in that folder. You will just need to make Administrators own that folder, then add ADministrators full control to its contents.

There is also a special folder/file protection for SVI in Win2008. So if you go through Explorer and try to delete files, they will… not delete. You will need to go through a CMD prompt.

10 years ago

Tom Bell

Hi Ned

I would like to know Microsoft offical position regarding storing DFS replicated data (via DFSR) on an MS cluster? Thanks

No. DFS Replication is not supported as a Cluster service resource. Replicated folders are not supported on shared storage.

"

– Ned

10 years ago

gysarosi

Hi Ned,

I have questions about the reporting features.

I have a customer who would like to see some statistical information about the replication effectiveness, like:

– The files (in a specified directory) which was replicated,

– The original size of a file,

– The start time of the replication of a file

– The end time (eg. when the file arrived to dest server) of the replication of a file

– The size of the data sent over the wire in bytes

I am wondering if the dfsrdiag is capable to create such a report (or an XML like you produced with the canary file which I can interpret or XSLT later),

OR

I have to write a solution which processes the logs from different servers and creates the reports.

This is a quick question before I build a virtual environment (lot of time) for testing the dfsrdiag. The first answer sould be Yes or No.

If the answer is Yes, the dfsrdiag can do this, the second part of the question is this:

Could you specify what parameters should I look for, please?

If the answer is No, I have several logs from my customer, so I will analyze them further (I dug myself into the logfiles, wrote some Regex for the processing, but it is a more complex work with multiple files).

For the end, there is a bonus question: where is the info in the log files which shows the replicated file’s original size?

Sorry for being gassy,

Gyorgy

10 years ago

acorn

Ned,

Thanks for a very informative post!! I stumbled upon this, however, while looking for a solution for replication that never even gets started. I keep getting this error every few hours, on both replication partners:

Event Type: Error

Event Source: DFSR

Event Category: None

Event ID: 4004

Date: 10/16/2008

Time: 6:05:42 AM

User: N/A

Computer: PGDC

Description:

The DFS Replication service stopped replication on the replicated folder at local path R:DFSDFSTestDFSTest2.

Hi guys. Sorry for the delay in response, I have been out of the office for a couple weeks.

@ Gyorgy –

We don’t have a perfect answer on this. All of that information is in the debug logs when they are set at level 5 verbosity, but parsing them will certainly require you to write some fairly complex string parsing code.

It is possible to see the statistics in a ‘meta’ fashion with the DFSR PerfMon counters, but they won’t be specific to a given file.

It is also possible to determine some of the data file-by-file by enabling auditing:

1. Create the following registry *key* (not value):

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesDfsrParametersEnable

Audit

2. Enable Object Access Auditing for these servers (via local or domain-based group

policy) for SUCESS.

3. Refresh policy with GPUPDATE /FORCE (there should be no need to restart DFSR or

the servers)

4. Replicate a new file from upstream to downstream partner.

5. In Event Viewer | Security Events on the upstream partner, you will see event 7006, 7002. On the downstream partner you will see 7004.

The problem here is not one single system has all this data – the DFSR database has some data, the perfmon objects have some data, the debug logs have some data. That’s why there’s no real easy way to do this.

This sounds very much like you have two drives in this computer with the same volume serial number. This could have happened by breaking a RAID1 mirror, using disk imaging software, etc. Do you *also* see a 6602 event when the DFSR service is restarted, stating ‘The DFS Replication service detected and discarded an inconsistent volume

configuration’ ?

This can be fixed, but it’s a bit scary. I am including the steps, but if you are not 100% confident in following them, I highly recommend opening a supporet case with us to assist you.

You can change the volume serial number of the disk using a utility called

dskprobe.exe [ A setup utility ]

This should be done on the Server with the DFSR 4004 and 6602 errors.

Before doing this, ensure you have taken a backup of the data on that

volume[drive]

This can be done as shown below:

1. Run dskprobe.exe.

2. On the menu options click "Drives", select "Physical Drive".

Then choose the physical disk and click set active. and Click OK.

3. On the menu options click "Drives", select "Logical Volume".

Then choose the logical volume and click set active. and Click OK. [ The logical

volume is the drive letter on which the staging folders are missing ]

4. On the menu options click "View", select "NTFSBootSector".

Notice the box "Serial Number (hex)". This should be the second last box on the

left most column.

a. the last 8 digits of this number would be the existing volume serial number.

b. change any of the last 8 digits to make this volume serial number unique [i.e.

Not the same as the drive it was conflicting with]

5. On the menu options click "Sectors", select "Write".

You might be prompted to turn off the "Read Only" mode. Agree to it.

6. Close the window. Close all open programs and reboot.

7. After reboot, check the volume serial number. It will have changed.

The staging folders should get created automatically and DFSR replication should resume replication on folders on all drives within a few minutes. The forementioned events will also no longer appear.

Forgot – dskprobe.exe is included with the 2003 Support Tools – download latest ones from microsoft.com

10 years ago

diego.marin

First of All thank you for the great info Above.

Can you possibly help me with the problems I am having below

I have a couple of questions regarding DFS-R

I have Two 2008 Core 64bit File Serves using DFS-R

How do I know if Initial Syncronization finished ?

How do I know if bandwith is an issue of slow replication ? I have it setup up to constantly replicate. I have about 10 replication groups, some of them are up to date all the time, others have around 150 backlogged transactions, but I can’t determine the reason for it. It takes a couple of hours to sync one single file (less than a meg)

My Staging folder actual size is using 3 Gigs out of 4. Should I double it to 8 Gigs. ?

I read that VSS is supported with DFS-R. But is this only the case if VSS snapshots sit on the same drive you are replicating ? How about if the Snapshots are located on a different storage ? Can you have VSS running at both targets independently ?

1. Initial sync is doen when the downstream server gets a 4104 event in its event log. You can see this on 2008 Core by using the WEVTUTIL event viewer or by connecting remotely to the event log from a Vista/2008 Full machine.

2. If you run DFSRDIAG BACKLOG <options> once an hour for a couple hours, are the samee 100 files always listed? Are those same files showing constant sharing violation events? Do you anti-virus scan your replicated folders? Does HANDLE.EXE (microsoft.com/sysinternals) show some particular application holding those files open all the time? The fact that you already suspect limited bandwidth is telling 🙂 – what are the connections like between servers? Very slow and thin?

3. Increase your staging if you are seeing 4202 DFSr events more than a few times a day. Doubling it is often a good start.

4. VSS snapshots are not replicated. You can definitely snapshot all servers.

10 years ago

diego.marin

Thanks for the quick response.

1. Great, I verified all my 16 Groups finished initail sync

2. No Antivirus installed on these servers Yet. Planning to do it soon. Should I exclude the DFSrPrivateStaging folders or any other folders once I do it ?

On Group A, I have about 3 files with sharing viloations for at around 6 days now. And about 170 files total (receiving/sending) that are backed logged. Could these 3 files be causing these large back log and making a file take 1 1/2 hours to replicate ?. Also the files with sharing violations are not the same that are being backlogged. These sharing files are not same that are on the backlog

Also one of the servers is not being accessed by anybody now since I have not enabled the dfs link on it and I am waiting to fix this slow replication issue first, however, this server has a backlog of 145 files pending to be sent. Shouldn’t it be ZERO, since there are no changes on any files at this server. And if I run the backlog report on the other server as the sending server where files are actually bieng changed, the backlog is actually lower, 31 Files. Same Goes for Group B, sending backlog is 667 where data is not being changed and where the data is being changed, the backlog is 95. Backlog Files seem to be the same for the past hour, I’ll keep an eye on them for the next few hours

On Group B, I have about about 47 files with sharing violations of which 7 have been there for one day and about 760 files files total (receiving/sending) that are backed logged. Backlog Files seem to be the same for the past hour, I’ll keep an eye on them for the next few hours

the bandwith between the location is 6Mbps, I have the replication set to use 4Mbps in each of the 16 Groups I have setup.

3. Event 4202 was constantly poping up during intial sync which is expected, but after this finished, it comes up once every 2 or 3 days in 2 of my groups which is taking a long time to sync. I don’t believe this is the case of slow sync but I will increase, it won’t hurt right ?

4. Thanks, so to confirm, I should be able to have VSS enabled for Drive 1 on Server A, have VSS snapshots located on drive 2 on Server A, then replicate data on from Drive 1 on Server A to Drive 3 to Server B, and have enable VSS on on Drive 3 on Server B, and have VSS snapshots located on drive 4 on Server B.

2. Yes, we recommend you disablew AV scanning of dfsrprivate (and will have an official document on this releasing at some point). Let me know what you see on that backlog after a few hours, sounds like it’s not to do with the handful of sharing violations.

3. Correct, it will not hurt.

4. Correct.

10 years ago

diego.marin

OK, same files are on the backlog and they are increasing. Almost like DFRS is taking a long break on certain groups but not reporting any real errors. The only errors are I see are the ones below, they show up twice a day but they follow up almost immediatley with the message right below which says everything is back to normal.

This is only afecting a couple of groups because I’ve verified other groups that have no backloggs what so ever and If I copy a couple of megs files, they gets synced within seconds. And this is on the same server

I am tempted to restart the service and see if replication starts backup up and the number of backlog files start decreasing again.

Also is my concern above valid where I have a backlog of sending files on the server where nothing is being changed ?

Again, I appreciate your help on this.

The DFS Replication service is stopping communication with partner XXXX for replication group XXXXXXXX due to an error. The service will retry the connection periodically.

Additional Information:

Error: 1726 (The remote procedure call failed.)

Connection ID: E0C605C7-622D-4889-8046-9B87EB52F157

Replication Group ID: 5F87902A-405D-47E5-BC3A-C0ADC76322AA

***************************************************

The DFS Replication service successfully established an inbound connection with partner

As far as a backlog where nothing is being changed – that’s not possible. *Something* is changing files, even if it’s not an expected change. 🙂 You could use Process Monitor or Object Access Auditing to see who and what it is.

10 years ago

diego.marin

I am running Windows 2008 Core x64bit.

I’ll dig deeper and see why there is a sending backlog on the server where nothing is being changed that I know of.

I staged this files using windows backup, could that be part of the issue since I see you crossed that part out in your article ?

At this point nothing seems to be replicating on some groups or its taking extremely long time to do so. Almost like if there were limitations on the scheduling. It looks like it starts by taking a long time then stopping all together. I will give it a couple of more hours tonight and If I see no change, I’ll restart the service and see if that makes any difference.

10 years ago

diego.marin

It appears that all of the Files that are backloged on the sending member where nothing has changed are located on the DFSPrivatePreExisting Folder. and not anywhere else. My understanding is that this data is not replicated and put aside. I am confused :@

Dang, I forgot you said 2008. Do you use HP Gigabit network cards in these machines? If so, you will need to go into the proerties of those NICs and turn off *HPs* built in scalable network pack pieces, as they are on by default (and our SNP is off by default in 2008).

Ehhh… in the preexisting folder. that’s bad. That folder is not replicated and DFSR will not replicate that folder. Are you sure that these are the files? That backlog report does not provide paths, so there may be multiple copies.

You can definitely always move the contents of the pre-existing folder out of the RF. Unless you want to actually restore those files…

10 years ago

diego.marin

It is actually a Dell 1855 Blade Server.

I am pretty sure those are the files, since I did a search on the file name on both servers and it only showed up on the server (which was where I staged my data) with the sending backlog (where nothing is being changed) on the preexisting folder, . My next step is to delete the contents of the preexisting folder.

So, since there was still a backlog and it was growing this morning, I decided to restart the DFRS service on the Sending Server, where data is being changed and guess what ? the groups I was having problems with, seems to be working now, the backlog is now decreasing.

However there is still a backlog on the server where nothing is being changed. It will probably go away once I delete the contents of the preexisting folder

So, it looks like we have 2 problems here

1. We have a server with pre-staged data trying to sync data on the prexisting folder but failing

2. We have 16 groups on a server, most of them working but something happens all of a sudden and stops replication for certain groups, no errors reported, and replication picks up again once the service is restarted.

Hmmm… if you are still having problems after restarting the DFSR service, at this point we’ve probably reached the end of effectively troubleshooting we can do in a blog comment post. 🙂 I’d recommend you open a case with us at that point so we can do deeper data analysis.

10 years ago

diego.marin

Restaring the service fixed the issue, but I have a feeling it will come back. but At least I have an understanding of what the problem is and how to fix it temporarily.

As far as the preexisting folder, after I deleted the contents and restarted the DFSR service, guess what ? no more back log. Another verification that there is something messed up with DFSR and it was trying to do something with those files.

If I decide to open a case with Microsoft, and find anything interesting, I’ll post it here.

Thanks again for your assitance.

10 years ago

rilavery

Fantastic Blog…….

Help….

I am trying to run DFSRdiag backlog to view files not replicating. However l keep recieving this error message…

Can you first use DFSRADMIN RF LIST <options> to dump the list of replicated folders for that RG and verify it exists, the spelling, etc?

10 years ago

rilavery

thanks for the post….

I ran the command above and results are below.

D:>dfsradmin rf List /RgName:DFS_FBV

RfName RfDfsPath

DFS-FBV

Command completed successfully.

On this domain we have 4 sites with 2 servers on each site, l do have the DFSRdiag Backlog running fine on one site and was trying to implement the same script on each of the others however l get the error message posted earlier i am using the command below editing for the DFS on each site.. Again thanks for any ideas you come up with..

users are working in the shared folder in server, so the problem is thes, i’m opening the exel file makeing some changeis then saveing, after a time when i open the file the changeis are lost.

can someone help me plese?

sorry for my poor english

even viewer reports

Source DFSR

Even ID 4304

The DFS Replication service has been repeatedly prevented from replicating a file due to consistent sharing violations encountered on the file. The service failed to stage a file for replication due to a sharing violation.

The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

We actually had a bug on that years ago (fixed in kb917622). I would recommend that you install the latest DFSR service hotfix and verify that you still have the behavior – if you do, please reply back here.

2 servers are 2003R2 and all microsoft updates are installed, also i installed WSUS_3.0

thank you ones agane

Date: 03/11/2008

Time: 8:46:30 AM

Source: DFSR

Event ID: 4304

The DFS Replication service has been repeatedly prevented from replicating a file due to consistent sharing violations encountered on the file. The service failed to stage a file for replication due to a sharing violation.

That’s a good question – there are definitely a lot of hoops to jump through for a QFE to make it into Windows Update. I’ll ask around and see if we have plans to do this in the future or not.

– Ned

10 years ago

ashujaku

I have configured DFSR beetween servers. Since there where too much files and problems with replications i used robocopy to copy all files on the other side. I got errors and warnings like "file has been changed on multiple servers" and i know that is normal becaose of robocopy. However another thing happened on the old server (server that had updated files innitially). After all was finished (since files where replicating in both ways), DFSRPRIVATE folder became big big. first i didn’t worry about it, since i thought i have to wait for all of the things to finish, but now i have only 40GB left on that partition. After analyzing I found out that STAGING folder is 40Giga and PREEXISTING folder is 250GIGa. I read somewhere that after using updated dfsr.exe (which i have installed from the time that i started experiencing problmes ) some of the people experience this. Interesting fact is that those files do exist on the server (i havent checked them all).

What do i do with them. do i delete or what? I am sure that my organisation was not so productive to create 250Giga of data in a month. I suppose that they are all somehow copy of somthink.

Another important fact is that they have a stamp modified 10/11/2008 (which i guess coreespodents to the time i worked on the replication a month ago). I would delete them but i read this article

The staging folder is as big as you configured the quota – so by default that’s 4GB per replicated folder.

The pre-existing folder only contains files that were not present on teh upstream when you set up replication – since the data is not accessible to end users, I suggest you back it up, verify the backup is good, and delete preexisting data.

10 years ago

Anonymous

We’ve been at this for over a year (since August 2007), with more than 100 posts (127 to be exact), so

10 years ago

ashujaku

Ned, you misundestood me on the issue.

All the folders inside Preeexisting (which started to become big after innitial replication was finished)

exist on both dfs servers. After innitial replication was finished i have put the values back to default (4GB) hoping that with time it will clean up.And it did clean up on the NEW server (in all folders inside DFSRPRIVATE), but not the old server. 2 folders just became bigger…actually here is very funny story.

In the old server i checked STAGING folder and PREEXISTING a week ago. Staging was 80GB and PREEXISTING 190GB.

after a week (yesterday) Staging was down to 40GB preexisting Growed to 250GB.

As i have read about it, Staging and Preexisting folder shouldn become bigger after innitial replication especially since they had the material

I am sure that i can delete Staging. I also know that when something is in the preexisting folder you cant find it somewhere else, but thats not my case. I see those folders and files out of preexisting which means they are some sort of copy. I am sure that my company did’nt produce 250 GB of word pdf and excel files in a month.

I am curios why it continues to happen.

I have also to confess that when i had some problems in the beggining, i stoped DFSR tried with NTFRS

and stoped immediately. than i did robocopy to copy all content of the OLD SERVER to NEW SERVER. Since Hashing is known issue i had that message (file has been changed on multiple servers….). After innitial replication i turn back all values to default.

Old server is x32 version and new is x64

Thanks Ned for everything , your blog helped me on the first place, it’s very interesting you do find anywhere on the net about dfsr

Ah, now I understand you better – sorry about that. yes, that is very intersting. It’s possible that something went very wrong during initial sync (or initial sync has never exited, but still with something gone wrong causing a loop) and data was being continuously recreated in pre-existing. I can say that I have never seen this before!

I recommend that you open a support case with us and have some deeper investigation. This is going to require a lot of data analysis that won’t be easy to do through the blog.

10 years ago

Anonymous

HI, Ned here again. We have put up a new(ish) KB article that will allow you to always see the latest

10 years ago

SvenJ

Hi Ned,

Excellent post! I can see why it’s so popular. I followed all steps above but things still get slow. The problem is I can’t really out the finger on the issue. Sure, there’s a few locked files (about 10 a day). The confusing thing is that one of the namespaces is working fine – and the other one spontaneaously decides to wait up to three hours for replicating a 1k file.

The only thing we can imagine is the amount of folders in the namespace. when designing the thing I read that you can use 5000 folders in a domain hosted namespace. So I figured that number would be the number of folders I put into the namespace manually and let replicate. Since there’s only about 10 of those, I wasn’t worried. Now that we have problems I think that I might have misunderstood that number… and that all subfolders in the folders’ targets are counted, too. In that case we’re in trouble: there’s a total of 15k subfolders in there…(don’t ask… some of our users seem to make a folder for each file…).

So: could the number of folders be the problem? They barely ever change, but the sheer volume…?

The 5000 folder limit is with actual DFS namespace folder targets, so that’s not really in play here. If this is Win2003R2, I would also recommend installing the SNP hotfix, and verifying in your network drivers that the vendor has not turned on their own home-made SNP settings (Receivce-side scaling, chimney offloading, etc). http://support.microsoft.com/kb/950224

If you are still having issues after following the whole blog post and that extra piece above, you might want to get a support case with us, as we’ll need to see a lot of data to figure out the issue.

– Ned

10 years ago

SvenJ

Hi Ned,

Now THAT was a fast resonse. Absolutely fabulous.

So my initial assumption was the right one. Kind of good, kinf of bad – that means I need to search further :-/

I already installed all the hotfixes mentioned above. I also implemented the other post-SP2 hotfixes I like the idea of the self backed SNP implementations. I’ll look into that.

As ususal Murphy’s Law holds true – right now there’s no backlog and things seem to be working fine – so I can’t see if things are ok… or if it’ll start acting up again next week (has been like that for a while now, so I suspect the latter).

Hey Ned, slightly off topic (but still DFSR related) do you know if KB961655 applies if you are deleting an entire replication group, and recreating it from scratch but using the same replicated folder name and path?

At the moment we have one replication group per replicated folder, and I’m planning on consolidating these replicated folders in to a single replication group.

The staging folder contains all of your staged files – so everything under ContentSet{GUIDGOO} is the actual files that are currently staged for replication. These hold the RDC signatures.

10 years ago

Tom Bell

Hi Ned

I have 10 branch servers replicating to 1 HUB server. I plan to replace the HUB server with another server in a different location. The existing HUB server will be decommissioned. What’s the best way to point the branch servers to the new HUB server? Thanks

The best way is to add the new server, get it replicating and and in sync, then change your replicaiton topology to make him the hub, then remove the old server – using DFSMGMT.MSC.

10 years ago

Tom Bell

Hi Ned

I will need to rename one of the replicated folders in a replication group. Since DFSR does not natively detect a folder renaming & there is no way to point to the new name in DFS management console, what is the best way to go about doing this? Thanks

10 years ago

hockeman

Hi Ned! I have setup a replication group with about 500GB of data and i’m getting the message that the initial replciation completed however, i have a backlog of 602 files that will just not move. Can you provide some insite as to why that may be? I’m running server 2008. Thanks in advance!!

Are they temporary files? Named .TMP? .BAK? Etc? there are lots of reasons – the fact that initial synx finished means that either:

1. they did not exist when initial sync was being done.

2. They are not considered valid for replication.

10 years ago

hockeman

Thanks for the response Ned. They are valid files… .doc and .xls files and they existed prior to the initial sync. If i add something to one of the folders on one server, they replicate, but on the other member if i delete them, nothing happens. The only thing different about these files compared to other replicated files was the archive bit… does that have an effect?

Also, when I say temporary files, I mean do they have the temporary file attribute set.

The archive bit doesn’t matter.

10 years ago

hockeman

Thanks for your response Ned. I applied the hotfix to my servers and identified the files with the temporary file attribute, ran the power shell command to recursivly repair them and still no luck. I recived the following errors with the DFSR heath reports early on but now i’m not getting them anymore. Any help would be GREATLY appreciated!

One or more replicated folders have content skipped by DFS Replication.

DFS Replication does not replicate certain files in the replicated folders listed above because they have temporary attribute set, or they are symbolic links . This problem is affecting at least 100 files in 1 replicated folders (up to 100 occurences per replicated folder are reported). Event ID: 11004

Just removing the temporary bit will not cause them to replicate – they need to be ‘touched’ in some meaningful way afterwards to trigger a USN update. A content modification, a security change, a rename, moved out and back in to the replicated folder, etc.

10 years ago

hockeman

I tried to "touch" each file by changing permissions and the backlog count didnt lower. Any ideas? Thanks Ned!

You will need to examine the DFSR debug logs then. Make a change to a file, verify that it did not replicate, then open the %systemroot%debugdfsr*.log file on that server. Find the reference to that file, and see what details it is providing about why the file is not being replicated.

If not copmfortable doing this, I;d advise opening a a support case with us.

10 years ago

hockeman

Hey Ned! I got it resolved and am fairly certain that it was the Temporary File Attribute that was causing the backlogged files. I ended up just deleteing the replication group and recreating it and all is well. Now i have another question that i can find a definitive answer on. Can i rename a server that is a member of a DFSR replication group? If so, does it trigger any kind of rescan? Any supporting docs would be great if you have them. Thanks again Ned!!

This will break DFSR as a number of topology attributes are not updated by renaming the computer object itself. We are toying with the idea of updating KB316826 to show how to do this for 2008 DC’s running DFSR for SYSVOL, but Win7 work has us seriously tied up and this is not a common operation (in fact, you are the first person to ever ask me this in years of DFSR).

In the meantime, the safe and approved way is to gracefully remove the server from the replication group, rename it normally, then add it back in (making sure AD replicaiton has converged between all three steps). This will cause replication to do initial non-authoritative sync on this server, but since you are doing this off hours and very little is likely to have changed in this short time frame, it should be over very fast. Just like using pre-seeded data.

– Ned

10 years ago

Chau Chiem

Hi Ned,

Hope you can offer some advice.

I have currently setup Windows 2003 R2 DFS on several servers. Theres a mix of SP1 and SP2 servers.

DFS replication is happily working right now, but was wondering if you able to advise any DFS hotfixes/updates i should be applying to avoid any potential problems in the future.

Secondly, how would i go about handling this situation.

A department would like to dump approx 50GB of data onto the DFS share. Is there any way i can pre-stage this 50GB of data onto the DFS servers and avoid having DFS replicate the full 50GB of data out to all DFS servers?

Or do i need to delete the existing replication group. Copy the 50GB data onto all of the DFS servers via external USB hard disk. Then create a new replication group?

It turns out we do have steps. Neither I nor the developers I spoke to were aware of this doc, but one of our tech writers chimed in and that shook the cobwebs free. Even though these steps are for DC’s running DFSR for sysvol, the same steps would apply for custom (with different paths, naturally). So there you go.

2nd question: Yes, using robocopy with very particular steps. This is documented in another bloh post here under ‘pre-seeding’.

10 years ago

hockeman

Hey Ned! Very cool about the rename. I’ll test it in my Lab. Regarding the Backlog issue I was having I simply deleted the replication group and re-added it and now it’s good… Zero backloged files. However, now I believe that I have screwed up my replication set by doing the "Big No No" of restarting the DFSR service because of WMI errors. I found this event message on one of the servers and wonder if you could provide some insight as to what may have happened and what I may could have done to prevent an entire rescan like is happening for all my sets now.

Event ID 5014-

The DFS Replication service is stopping communication with partner SERVERNAME for replication group dannenbaum.localdfsrootdatamyreplicaset due to an error. The service will retry the connection periodically.

Additional Information:

Error: 9036 (Paused for backup or restore)

Connection ID: 4C4497AF-A035-4AA0-BB73-1C58DD479F35

Replication Group ID: A1D6E57C-EED1-4B4B-B5DD-53120BCC466A

10 years ago

Chau Chiem

Hi Ned,

How do i control client DFS referrals for clients with 2 DFS servers?

We have two offices at seperate locations. With different ip subnets i.e 192.168.1.x adn 192.168.2.x

However in AD sites and services the subnets are under the same site.

Office 1 has a domain controller and is the name space server. Office 2 has no domain controller and is a name space server.

Clients at office 2 accessing the dfs share e.g \domain.com.aushare are going to office 1 DFS server, i know this by checking the DFS tab when you right click -> properties of the DFS folder.

People at Office 1 are happily using the Office 1 DFS server.

I want to direct people at office 2 to use their local DFS server, not the one at office 1.

Hi Chau – this is more of a DFS Namespace question than DFSR. Since the subnets are both defined on the same AD logical site, there is nothing you can do to control the DFS target priority for those branch users. Your IP subnetting needs to match your logical sites,as DFS doesn’t know anything about the physical network.

Can you check to see if the following has happened? BE sets this key that can get us into some trouble:

Look at

HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesDfsrRestore.

There will be a sub key named year-date-time the restore was done with two values. One of those values will be the network name that was used to perform the remote restore.

Backup and delete the restore subkey, then restart DFSR. The service will most likley hang on shut down but it will stop and restart. After the reg value is removed the service start and stop will be normal

The key was deprecated in 2008.

10 years ago

hockeman

You say the key was depreciated in 2008. Does that mean that if we are running server 2008 that we are in the clear for this key? Also, we do not have the key on the DFSR server.

I have two servers both running Windows 2003 R2, the backup server has just been recently promoted to a DC with the primary one being a member server, and are located at the same site.

I am using DFS as a backup only. I currently have set up 3 replication groups, 2 of the groups are working fine and replication shows no errors apart from the usual sharing violations.

The 3rd had been working fine for a number of months until recently. The folder in question is quite large with a stupid amount of files and folders over 7million files please don’t ask what my users get up to!

For some reason certain folders directly under the main directory were not being replicated, no errors shown in the event logs. So I decided to delete the group and re-create but using smaller replicated groups. However I am now receiving the error below, and am unable to find any info on how to solve this problem. The other replication groups are working fine.

The DFS Replication service stopped replication on the replicated folder at local path F:CompanyAllIPReadOnly.

Unfortunately, this isssue is pretty serious. Your database is damaged and has a stale reference to the replicated folder, and is now blocked. The only way to fix this is to rebuild the database itself.

I would recommend you open a support case with us to walk through this carefully in order to minimise your recovery time and not cause any data overwrite/loss issues. If this is not possible let me know and I’ll give you the steps offline through an email (just ping me through the email form on the top of the blog menu); I don’t like having these steps floating around as they tend to get used too much when there are other solutions, usually.

– Ned

10 years ago

eldad

Hi Ned,

First of all i would like to thank you for all the information in this blog. It is very usefull.

I have a problem lately…with one of the servers.

It is connected to head office via adsl…both servers run win 2003 R2 Sp2…

the reports show there is nothing in the queue…all clear….but for some reason…the server in the remote office started lately to consume 200Kbps on a permanent basis…eventhough there is nothing replicating (i think)…when i stop DFS Rep Service..bandwidth go down to nearly 0.

On 6/16/08, you stated that we "should not share the same staging directory". I’m a bit confused because the "Additional information about DFS Replication staging folders" section of http://technet.microsoft.com/en-us/library/cc772778.aspx makes it sound as if this is recommended in some cases (see the 2nd bullet point).

Could you please advise what may be wrong – i have folder, that replicated between 4 servers, connected by WAN connections. A few days ago i’ve spotted that replication becomes very slow. I checked backlog file count on every server and found, that on one of them has very large file queue – around 500k files. On other servers backlog file count is around 100-500 files. The replicated folder contais around 600k files at all, how can be 500k in queue?

If it’s nothing on the above list of 10 items, you should open a support case for further troubleshooting. This will require significant time and data analysis, as well as collecting a few GB of data from you to analyze – basically, out of scope for this blog.

– Ned

10 years ago

turg77

Hi Ned,

We recently moved our 2003 R2 DFS to a Native 2008 DFS. I was having no problems with the R2 DFS since upgrading to the newest .exe that you previously recommmended. However, now it seems if there are sharing violations on files long enough, the DFS moves/deletes the files to "Deleted and Conflicted" and the users have to call me to recover them. Any ideas?

Hey Ned… For some reason when using the script that has been provided to check for backlog counts, i get duplicate listings when running the following command on my servers for one of my replication groups. When running the command for other groups, the output is as expected (Two lines showing backlog counts for two servers) See the example below-

It’s quite the handy script… I just can’t quite figure out why I’m getting duplicates. It’s getting this information for DFSR via WMI so it has to be duplicated somewhere in the system. I don’t think it’s a problem with the script because of the fact that I can run it against any other Rep group without a problem. It works PERFECTLY! I’m simply looking to understand where in the system I could look for these duplicate entries.

Do you have the same problem using the WMIC.EXE tool against those three DfsrReplicatedFolderConfig, DfsrReplicatedFolderInfo, and DfsrReplicationGroupConfig classes? Just to completely rule the script out once and for all, I mean.

10 years ago

hockeman

Hey Ned I’m having a hard time replicating this with the WMIC Tool. I’ve sent an e-mail to the MSDN team that developed the script. In the mean time, can you tell me where I may look for something like this? As I said, it doesn’t behave like this for my other replication groups so although I’m not sure yet, I’m pretty sure it’s not the script. Thanks!

Could be in AD (under the computer dfsr-localsettings objects or the System dfsr-globalsettings objects), could be in the XML files cached locally on the server in the system volume informationdfsrconfig folders on each drive, could be registry under the dfsrparameters key, or could be in the WMI repository itself (which is totally opaque – hence the need to use the WMIC.EXE tool to confirm).

For your first question: that is normal. Your backup software is stopping DFSR in order to run backups. If you don’t want that error you will need to speak to Legato about changing their software, or invest in a different backup product.

For your second issue, there was a database problem that DFSR fixed automatically. There is no reason to do anything further there.

Hi NedPyle, I would like to thank you for posting all this info. I currently managed 30x DFSR servers with about 3TB’s of data all across WAN links. I’ve applied those 2 patches you recommend on every single server. I must say the DFS replication is running much better between the servers now. This blog is my DFS Bible 😛

My question: I upgraded a RAID array of about 1TB of data by doing the following.

1. ROBOCOPY /COPYALL… to another location

2. Expand RAID arrray

3. ROBOCOPY /COPYALL… to original location

I wish I had found this blog before I did that. /COPYALL = evil!!

DFSR is now trying to replicate every single file (1.7 million files). I get the following error…

"The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder."

Please could tell us what exact parmeters I should be using with ROBOCOPY to copy DFS data back and forth.

COPYALL is ok as long as the root folder permissions don’t change, causing inheritance to permissions to change. If they do, that changes the hash on all the files, and you end up with this situation. It certainly is evil. :-/

It certainly is good to have the ear of someone who knows what they are talking about!

I’m having a little problem with one of my DFS-R replicated folders. I have 3 servers all Win2K3sp2 R2 with 10 replicated folders. Two servers are in our main datacenter and the third is in a DR site. All but one of the folders replicates fine (except for some Excel files but, that’s a different subject). I pulled a DFS Health report this morning and saw that one of the folders has a large number of backlogged receiving transactions. I went and looked at the three servers and only one of them has anything in this particular folder. Meaning that this folder has never started it’s initial replication.

Is there a way to force dfs-r to perform this initial replication on just this one folder?

It’s a very open-ended issue. I’d start by setting DFSR debug logging severity to 5 on all servers, then dropping a simple test file named after each server on them their respective folders. Then examine the debug logs to see what happens with that canary file on each box, any errors, does it replicate, etc.

Hey Ned, What is the latest version of dfsrs.exe that I should be running on my Server 2008 x86 boxes? Everything has been pretty solid for the past couple of months but I’m about to roll to production and want to be sure that we are all up to date before moving forward. The current version that we are running is 6.0.6001.18000. I’ve looked at the patches for DFSR in the previous article but all of them say "Not applicable" to Server 2008 x86. Thanks!

Whoops! We have a KB article for 2008 and 2008 R2 now that tracks those, but I completely failed to update the above #2 with its link. It’s there now, and here:

KB 968429 – List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2

10 years ago

hockeman

Hey Ned! Do you know of, or can you find out if running the ‘DFSRDIAG Backlog etc. etc.’ too much can slow down replication? Specifically speaking to the initial replication? I Just want to confirm that this tool doesn’t put anything on hold while checking backlog counts. Thank you!

It’s not free – there’s a certain amount of expense when you run that tool as it has to query the DFSR jet database for all outstanding backlogged records. It’s not particularly efficient in 2003 or 2008 (it is much more efficient in 2008 R2), hence why it is limited to showing only 100 file names.

Bottom line – don’t run it often if you care about performance, as you will definitely be slowing things down. How much, is hard to say – depends on too many factors.

Hi Ned, I have created several replicated folders and the work great, except I have 3 folders that generate this error "Pre-existing content is not replicated and is consuming disk space.".

I have attempted to delete this information on the Target server, however when I try to access the DFSPrivate directory in Windows Explorer I receive the message "Access is Denied". I can display the files in the Command Prompt, but can’t delete them on the target computer. I even attempted to "Take Ownership" of that directory and subdirectories as well as reset permissions, which appeared successful, but when I went back to delete it I still received "Access is Denied". How can I clean out the Pre-existing files?

That is how Excel 2003 and older works, as I recall – you will see that even without DFSR (run Process Monitor on your local computer and make some changes in an Excel doc locally). It does this sort of swappy rename behavior behavior. I don’t have Excel 2003 running to confiurm this though. If it’s ending up with the wrong file we’d need to investigate the debug logs to see where things are going south.

Please open a support case for troubleshooting on that, it will be worth your time.

10 years ago

FLuhm_1

Wow ! Just read this whole thread, and I’m feeling enlightened. Thanks for the great info.

I wanted to ask a little more about the Staging folder size configuration for larger files. I did read the perf guide on Technet, etc., but I’m still curious on a simple scenario. Here’s what I’m trying to tune:

Two Win2003 R2 SP2 servers using DFSR for only the purpose of backup data replication to a remote site for DR.

I have a single replication group with both servers included, and have configured the schedule, etc. It seems to work fine, but I have alot of staging folder cleanup event entries, which raised my concern about the performance, etc.

On Friday evening, we add approx 200GB of backup images to one of the members (ServerA). This 200GB is spread across appprox 14 files (Backup Exec Server Recovery BESR backup files). We then replicate all weekend to the remote site (ServerB). On Mon-Thurs, we have nightly approx 3GB of image files spread across approx 14 files per day. Each day, the files have a different name. We have the Replication group schedule open FULL overnight to allow for replication during non-business hours. Data is always originated on ServerA (after backups are done) and replicated out to ServerB.

How do I best configure the size of the staging folder for ServerA, and ServerB. Since there are only two members, and each day, the files to replicate have different file names? I am confused about the 9 files at a time, down to 1 file at time when staging is over 100%, etc. I initially thought to follow the info above about ensuring the staging size is greater than the largest files, but I was not sure if I should set a 200+GB staging value. I was concerned at how this might affect free space on the server volume where the replicated folders exist.

Dazed and confused on how to best proceed. Thank you for any advice you can offer.

For Win2003R2, we ordinarily recommend that your staging directory be set at least as large as your 9 largest files. This is because 2003R2 can replicate 4 files *in* and 5 files *out* concurrently. In 2008 (and soon to be R2), we say as large as your 32 largest files (as it will do 16 files in and 16 out concurrently).

For your case, where replication is quasi-one way – i.e. the DR site is never going to originate any changes – you would want:

1. Your ‘main server’ (where files originate) to have its staging be at least the size of your 5 largest files, in order to minimize taging cleanup.

2. Your ‘DR server’ (where files will be received) to have its staging to be at least the size of your 4 largest files.

If you have the disk space and an ideal world, the ulimate in staging perfection would be to have the staging space be the same size as all your data. Disk space has gotten pretty cheap (I saw a 1TB drive at Best Buy last week for $120 – ridiculous!), so it may be worth adding more storage in order to ensure your DR site performance is optimal.

– Ned

10 years ago

Jeff.Miles

We’re having an issue between two DFSR members across a WAN link, where beginning about the middle of the day the backlog in one direction (from the hub to the spoke) begins climbing. It only clears out after the workday has finished.

I think this may be related to cause #6 of this blog post, with sharing violations. We do receive a large amount of sharing violation warnings on both members, mostly from AutoCAD DWG files as our users work directly off the server.

The functional mode is Server 2000, so only 4 files can be transferred at a time correct? If so, will DFSR keep retrying the same 4 locked files, until it is able to pass them through, or will it move onto other backlogged files? If not, can this be tweaked at all; for example, how long to skip the locked files?

Functional mode won’t matter, but being Win2003R2 versus Win2008 will matter quite a bit. Win2008 can replicate at least 4 times faster inbound (16 inbound files at once), and actually typically replicates around 10 times faster (asynchronous RPC improvements). Lots of sharing violations are part of the issue. Being Win2003R2 is the other part.

DFSR will move on to other files, but periodically retry the previous locked ones. If a lot of files are locked (hundreds, thousands) all the time, it could really bog down as it will spend a great deal of time trying files that are not going to replicate until the user unlocks it. But no, it will never totally halt as logn as there’s work to do and files that can be worked.

– Ned

10 years ago

Jeff.Miles

Thanks for that info. We are moving to upgrade the namespaces to Win2008 (I had emailed you previously about that), so maybe this will accelerrate it.

I’ll have to turn on the EnableAudit function of DFSR to be sure, but I don’t think too many files are getting through. Today the backlog has risen from 4 files at 9AM to 600 files at 2PM to 900 files, now currently 4:30PM.

10 years ago

turg77

Well, I guess we’ll have to breakdown and open a support ticket with Microsoft… However, for informational purposes, we’ve had nothing but problems since "upgrading" our Win2003R2 DFSR to Win2008 DFSR. Clean installs across the board. It seems muitple times daily we get calls about missing files which turnout to be in the ConflictAndDeleted folder. It’s seems the conflict resolution algorithm thinks nothing is a winning file… Ugh! Perhaps Win2008R2 will bring happiness, but SP2 didn’t.

We’ve found at least two 3rd party applications that cause that – it’s odd create/rename/delete behavior makes DFSR delete folders incorrectly. All the more reason to open a case. To date, no DFSR ‘all by itself’ deletions with folders in 20032008 though.

– Ned

10 years ago

turg77

Ned,

Are talking like an anti-virus app? Because were pretty much a XP/Vista and Office 2003/2007 shop? Most files that disappear are .doc or .xls or folders with those types of documents in them.

Thanks, Ned. Is there a article on the perfered method to backup a volume replicated by DFSR? My head is barely above water right now, so I haven’t been able to call support. However, I’m thinking our backup software is screwing with the journal which is causing problems for DFSR. Feasible? We got this Event ID 2206, DFSR, this morning:

The DFS Replication service successfully recovered from an NTFS change journal wrap or loss on volume E:.

Ouch. There is only one method – using a VSS writer. If you backup software doesn’t use that, it’s unsupported.

Journal loss == bad bad bad. 99.% of the time, that is due to failing hardware.

10 years ago

turg77

Head in hands… I was thinking about changing all my kids’ names to Ned before that "good" news… I might be in the one percent area, though, because we also got the same error on a different hardware volume (same server) that has the sysvol_dfsr on it. And, the events were posted on both servers in the replication groups…

For the 1% – was the DFSR service off for several hours/days and a ton of files modified in the meantime?

10 years ago

turg77

It seems to have stopped replicating last night which it seems to do from time to time (backup software?). It usually comes back on line. But, it hasn’t yet this morning. I’ve uninstalled the backup software client.

10 years ago

turg77

I’ve been put in the call back queue for the your team… Thanks.

10 years ago

turg77

Hi Ned: I’m guessing you review support cases, but just in case… We’ll know for sure over the next couple days, but after about eight hours on the phone it looks like the combination of the DFSR setting "Move deleted files to Conflict and Deleted folder" and some other settings (and perhaps Excel 2003?) were causing my problems. I’m hoping for the best… Thanks, Jason

10 years ago

turg77

Is the ConflictAndDeleted folder dynamic? While monitoring it on one of my servers in the DFSR group, I see files go in and out of it. Support wasn’t able to answer the question.

Step 4. I run through the DFS setup and tell it to replicate the remote office to the datacenter server.

The problem is that every time I do this the DFS service performs the following on every single file…

"The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder"

Sometimes we have 350,000 files at an office and it takes DFS about 1 month to perform its "rehash" (for a lack of a better word).

What am I doing wrong? Is there a better way to get the data into the datacenter?

Your help is MUCH appreciated!

10 years ago

Jeff.Miles

Zuldan, I think this may be expected behavior. DFSR still has to go through and check each file. Right now I’m performing the same operation as you, only the two servers are in the same LAN, and we’re deploying a new server. I robocopied the data from existing master to the new server, and during initial replication, every file gets a conflict error.

However, the initial replication takes much less time than if we had not pre-seeded the data, and afterwards there are no files in the pre-existing folder, so I think its normal occurance.

I would imagine its taking so long for you because your remote server is still remote, so it has to check every file across the WAN. Over a high latency link, this will take a while.

10 years ago

Jeff.Miles

What does one do when there are a few files stuck in the backlog, when you’re sure that they’re not currently open? We have 3 files in backlog for a replication group. The server pushing out the updates has been restarted multiple times since we’ve seen the backlog. These files won’t replicate to multiple partners, both in the LAN and WAN.

You should not be getting conflict events if the file hashes really do match. If you believe you have gotten security to match perfectly, it could be some other change to the file. This won’t be diagnosable through a few blog comments, please open a support case so we can examine the data more closely.

– Ned

9 years ago

tedkar

Hi Ned! Thanks for all the valuable info.

Have you ever seen and resolved this error: "[ERROR] Failed to execute GetOutboundBacklogFileCount method. Err: -2147217406 (0x80041002)" when running dfsrdiag /backlog?

This is part of an automated script on about 15 servers with only one having the error. What little I was able to find on this error indicated a WMI problem, but when I use WMI Diag the server appears fine. Any suggestions?

> Important note: If you are in the middle of an initial sync, you should not be rebooting your server! All of the above fixes will require reboots. Wait it out, or assume the risk that you may need to run through initial sync again.

Hi Ned!

Does this mean that initial sync is not-restartable process that can be only completely restarted from scrtatch every time instead of just being paused and later resumed from the same point?

Not precisely – back in 2003 R2 there were a number of issues that could be cause initial sync to not restart at all and be only partially completed. In 2008 this issue is removed. This blog post is extremely old…

9 years ago

Tibido

Hi Ned,

I inherited a network that is using DFSR. The server are all Windows 2003 R2 SP2. DFSR was working well as far as I know, but now it gives me errors in the logs.

Such as 1. Event ID:5014 with Error: 9033 (The request was cancelled by a shutdown)

2. Event ID:5014 with Error: 1726 (The remote procedure call failed.)

3. Events 5008, 5012, 6802.

I just installed KB950224-v3 and am hoping that will resolve my issues. 🙂

If you can offer any help or suggestions, it would be greatly appreciated.

Regards,

Tibido

9 years ago

Basheer

Hello Ned,

How are you thanks for supporting people on DFS issue

I have am redesigning the DFS at customer place they have 3 dfs servers windows 2003 r2.

Server1 (hub)

Server2 (spoke)

Server3 (Spoke)

Now I have restructered the folders like before:

folder D:rootResearch – was configured in replication group Research

again subfolder D:rootResearchtools – was configured in replication group called tools

Now to remove this inconsitancy I have deleted the tools replication group since already parent folder is replicating the same data that is d:rootResearch.

Now I discovered the d:rootresearch folder doesn’t exist on spoke servers(server2 and server3).

I have checked event logs, dfsrlogs, nothing much I found the reason for this.

As a workaround I found that If I move this folder to other location and move back then replication getting started and working fine. I did same for small folder in "d:rootresearchrest" it worked fine exist on all spoke servers (server2 and server3).

Now I dont want to use this workaround on the big 400 GB D:rootResearchtools folder can you tell me how to troubleshoot this issue.

Many thanks in advance.

Regards,

Basheer.

9 years ago

Basheer

Sorry correction

Hello Ned,

How are you thanks for supporting people on DFS issue

I have am redesigning the DFS at customer place they have 3 dfs servers windows 2003 r2.

Server1 (hub)

Server2 (spoke)

Server3 (Spoke)

Now I have restructered the replication folders;

like before:

folder D:rootResearch – was configured in replication group Research

again subfolder D:rootResearchtools – was configured in separate replication group called tools

Now to remove this inconsitancy I have deleted the tools replication group since already parent folder is replicating the same data that is through a separate RG "d:rootResearch".

Now I discovered that d:rootresearchtools folder doesn’t exist on spoke servers(server2 and server3).

I have checked event logs, dfsrlogs, nothing much I found the reason for this.

As a workaround I found that If I move this folder to other location and move back then replication getting started and working fine. I did same for small folder in "d:rootresearchrest" it worked fine exist on all spoke servers (server2 and server3).

Now I dont want to use this workaround on the big 400 GB D:rootResearchtools folder can you tell me how to troubleshoot this issue.

In this excellent article you reference to KB968429 — List of currently available hotfixes for Distributed File System (DFS) technologies in Windows Server 2008 and in Windows Server 2008 R2. That is really valuable article… Or it was before it stopped being updated last spring. More and more DFS-R hotfixes come out these days and none of them got referenced in KB968429.

I sounds like there is a database issue. If you don;t want to use that workaround, you will need to open a support case so that the environment can be examined in detail.

@ Artem:

Yep. Don’t worry, several updates to that KB are on the way. The whip got cracked on this a week ago, your timing is excellent. 🙂

9 years ago

Basheer

Hi Ned,

Thanks for the information.

I have used the same workaround. Now this huge data has to pass through wan link.

Further I see there DISK Quota Hard quota is enabled with is only 5GB is free, so now I am changing the staging folder default path to speed up the replication and increase the size of stagging folder.

regards

Basheer/

9 years ago

Basheer

Hi Ned,

That woraround of moving file did not work. Again after sometime it cleared all these folder from the two partners server1 and server2, now I mam not sure how to proceed, can you please suggest how to proceed.

One of my 17 replication groups to replicate stopped after one of the servers involved in replication has been restarted a few times.

In DFS Replication – Health Report, I receive the msg below:

The DFS Replication service is restarting frequently.

Affected replicated folders: All replicated folders on this server.

Description: The DFS Replication service has restarted 5 times in the past 7 days. This problem can affect the replication of all replicated folders to and from this server. Event ID: 1004

Last occurred: segunda-feira, 22 de fevereiro de 2010 at 07:40:12 (GMT-3:00)

Suggested action: If you restarted the service manually, you can safely ignore this message. For information about troubleshooting frequent service restart issues, see The Microsoft Web Site.

After several of these boots in one of the servers the files are not being replicated to the receiving member.

How can I solve this problem?

Regards,

Bruno.bbc

9 years ago

sweech

Hi Ned,

Do you think adding some supplementary hubs can improve data replication speed? I have one primary world server, 3 regions servers replicated from this primary, and about 40, 30 and 30 servers replicated from these 3 regions servers : what about adding one supplementary hub to each 3 regions servers? would this help?

What direction is the data primarily flowing – from the 100 spokes towards the 1 primary? The 3 regional hubs could be overloaded by 30+ spokes if the regional was inbound replicating. With 2003 R2 it could only handle 4 files at a time. With 2008/R2, 16 files by default, and the option to tune up more. If this was all 2008/R2, doubling the layer of regional servers could potentially double replication performance.

I will be creating a new DFSR tuning blog post in the next few weeks BTW way, it covers more about this.

9 years ago

sweech

Hi Ned,

Thanks for your feedback. So the flow is from world primary towards spokes (through regionals). Files/folders are only updated on the world primary server and replicated to spokes. the purpose is to speed up replication from this server to others. All servers are 2003 R2. Well, correct me if i am wrong but the approach to add supplementary hubs is not the good one according to you? What would you recommand?

We’re just getting into the business of using DFS in our environment. We’re all upgraded to the latest and greatest, 2008 domain and using 2008 R2 servers for our DFS hosts.

Here’s my question. We are hosting quite a bit of data (700+ GB, 6,000,000+ Files, in 3,000,000 Folders). Now I’ve looked at the documentation, which is telling me that 2008 DFS doesn’t have any limits and to only watch performance. All of this data is in 1 replication group. The reason for doing this is that we are using group policy redirection, which points to the sub folders within this location. So if we broke the namespace up into sub letter target folders instead of 1 massive target folder, we wouldn’t easily be able to continue to use these policies. And to let you know, both of these servers are in the same location and we aren’t ever going to use another across a WAN or in a remote location.

With all of that being said, obviously initial replication takes a while along with some massive data copies as we are mirroring our production environment to keep the data fresh. I was wondering if there is any way to make replication go any faster during these massive file copies? Especially because we aren’t concerned with bandwidth usage, we’d prefer that they go near 100% if they could.

And do you have any recommendations for our DFS design? The good part is when initial and mirrored replication is done however, it does replicate smaller changes very quickly.

With the huge amount in the replication group, I’ve also seen the servers take a long time to rebuild their databases if one becomes corrupt in some way. Do you have any recommendations besides keeping A/V away from them, to keep them safe?

2) Make sure that storage is functioning as expected and that all firmware and drivers are up-to-date.

Thanks for following our blog!

–Jonathan

9 years ago

mclegg

Hi.

We’re trying to use the dfsrdiag backlog command to gather some trends about our replication topology, and find that the backlog filenames are not listed if we run the dfsrdiag backlog command under Windows 7 or 2008 R2. It works fine under straight 2008.

All we get is something along the lines of…

Member <server> Backlog File Count: 4

Backlog File Names (first 4 files)

but no filenames.

I know it’s only a trivial thing, but it is irritating having to RDP to 2008 server to get the file list when we should be able to do this from a local client.

This works fine when I run it on 2008 R2 – please be more specific in your repro steps. Are you saying it only doesn’t work when the dfsrdiag backlog is run on a 2008 R2 server and is pointing rmem/smem to a 2008 NON-R2 server?

9 years ago

mclegg

Hi Ned, thans for the response.

It only seems to fail when I run dfsrdiag on a W7 or 2008 R2 machine. The rmem/smem are a combination of 2008, 2008R2 and 2003.

As a specific example.

smem 2008 (x64 non-R2)

rmem 2003 (x64 R2)

No file list is produced when running dfsrdiag on W7, or 2008R2, but running the same command on 2008 (non-R2) or 2003 is fine.

Ah. I am able to reproduce this, when running the new DFSRDIAG against *non-2008 R2* servers. If the smem/rmem are 2008 R2, it works fine.

There were a bunch of changes in 2008 R2 to make the backlog command work faster/better, it looks like this new version of dfsrdiag is not fully backwards compatible. I’ll look into this a bit more to see what’s up.

Nice catch, thanks. 🙂

9 years ago

scorchtoggs

Hi Ned,

I am having some issues with my DFSR environment. We are running Windows 2003 Server R2 SP2 on all of the servers. We have NOT applied all of relevant DFS hotfixes and patches per KB article 958802. There are about 11 patches on that list, of which I have only applied KB933061. We are running into a problem where several hundred file/folders are NOT replicating for some reason and they are being dumped into the Deleted and Conflicted folder with Event ID 4302. Here is the information from the DFSReport.

scorchtoggs, I deleted your post. Don’t be alarmed, it’s only becasue you posted your case # in there. You should treat that like a social security number and never post it publically – other people could use it to get support and you will get the bill.

Please continue working with your support folks. You can also ask them for escalation if you are not making progress.

5 years ago

Anonymous

Warren here again. This is a quick reference guide on how to calculate the minimum staging area needed

5 years ago

Anonymous

Pingback from DFS – Logs de eventos 4202, 4204, 4206, 4208 e 4212.

4 years ago

Anonymous

My name is Bryan Zink and I am a Microsoft Premier Field Engineer focused on supporting Windows Server

4 years ago

Anonymous

My name is Bryan Zink and I am a Microsoft Premier Field Engineer focused on supporting Windows Server

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

Top 10 Common Causes of Slow Replication with DFSR – Ask the Directory Services Team – Site Home – TechNet Blogs

4 years ago

Anonymous

This is a collection of the top Microsoft Support solutions for the most common issues experienced when

4 years ago

Anonymous

This is a collection of the top Microsoft Support solutions for the most common issues experienced when

4 years ago

Anonymous

This is a collection of the top Microsoft Support solutions for the most common issues experienced when