Implementing Content Freshness protection in DFSR

Hi all, Ned here again. Starting in Windows Server 2008, DFSR supports a protective mechanism called “Content Freshness”. Today I’ll discuss this protection, how to implement it, and what to do when it swings into operation.

Background

Content Freshness is an admin-defined setting that you can set on a per-computer basis when using DFSR on Win2008 or later – it does not exist on Windows Server 2003 R2. The DFSR database has a record for each Replicated Folder (RF) called CONTENT_SET_RECORD. This record contains a timestamp called “LastConnected”. We store this record on a per-Replicated-Folder basis because it’s possible for a replicated folder to be current when it’s connected to other members in that replication group. At the same time, another replicated folder can be stale because it is not connected with other members in its replication group. Every day, DFSR updates this timestamp to show the opportunity for replication occurred. When attempting replication for an RF between computers, the DFSR service checks if the last time replication was allowed is older than the freshness date. If the last-allowed-replicated date is newer, it replicates. If it’s not, we block replication.

By now, you’re asking yourself “why would I want to block replication.” Good question. DFSR has a JET database just like Active Directory, and it uses multi-master replication just like AD. This means that it must implement tombstones to deleted items to replicate. When a file is deleted in DFSR, the local database records the deletion as a tombstone in the database – a logical deletion. After 60 days DFSR garbage collects the record from the database and it is truly gone – a physical deletion. Online defragmentation of the database can now reclaim that whitespace. The 60 days allows all the replication partners to learn about the deletion and act on it.

And herein lays the problem. If a DFSR server cannot replicate an RF for more than 60 days, but then replication is allowed later, it can replicate out old deletions for files that are actually live or replicate out stale data and overwrite existing files. If you’ve ever worked on an Active Directory “lingering object” issue, you have seen what can happen when a DC that was offline for months is brought back up. This is why Strict Replication Consistency was invented for AD – Content Freshness protection is the same thing.

Being “unable to replicate” can mean any one of these scenarios:

Disabling the replication connections.

Deleting the replication connections (either one-way or in both directions).

Stopping the DFSR service.

Closing the schedule (i.e. setting “no replication”)

Keeping the server shut off.

This whole content freshness idea is novel enough that we went to the trouble of applying for a patent on it.

Implementing Content Freshness Protection

Content Freshness protection is not enabled by default in Windows Server 2008 or Windows Server 2008 R2 (it is enabled by default in Windows Server 2012 and later though!) . To turn it on you simply modify the DfsrMachineConfig setting for MaxOfflineTimeInDays on each DFSR server with:

Remember, this has to be done on all DFSR servers, as this change only affects the computer itself. This value is not stored in a central AD location, but instead in the DfsrMachineConfig.XML file that resides in the hidden operating system folder “%systemdrive%\system volume information\dfsr\config”:

Remember, by default this protection is OFF and be assumed to be zero if there are no entries in the DfsrMachineConfig.xml.

Note: Sharp-eyed admins may notice that we actually have an AD attribute stamped on every Replication Group called ms-DFSR-TombstoneExpiryInMin that appears to control tombstone lifetime. It even has the value – in minutes – for 60 days. Sorry to disappoint you, but this attribute is never read by DFSR and changing it has no effect – tombstone lifetime garbage collection is always hard-coded to 60 days in the service and cannot be changed.

Protection in Action

Let’s see how all this works. My repro environment:

A pair of Windows Server 2008 R2 computers named 2008r2-fresh-01 and 2008r2-fresh-02

Replicating in a Replication Group named “RG1”

Using a Replicated Folder named “RF1”

Keeping a few user files in sync.

MaxOfflineTimeInDays set to 60 on 2008r2-fresh-02

Important note: I am going to simulate the offline time by rolling clocks forward. Never ever do this in production – this is for testing and demonstration purposes only. Also, I only set MaxOfflineTimeInDays on one server – you would do this on all servers.

So here’s my data:

Now I stop DFSR on 2008r2-fresh-02 and roll time forward to January 1st, 2010 on both servers – about 75 days from this writing. I then make a few changes on 2008r2-fresh-02.

And then I start the DFSR service back up on 2008r2-fresh-02.

My changed files do not replicate out

New files do not replicate in

I now have this event:

Log Name: DFS Replication
Source: DFSR
Date: 1/1/2010 3:37:14 PM
Event ID: 4012
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: 2008r2-fresh-02.blueyonderairlines.com
Description:The DFS Replication service stopped replication on the replicated folder at local path c:\rf1. It has been disconnected from other partners for 76 days, which is longer than the MaxOfflineTimeInDays parameter. Because of this, DFS Replication considers this data to be stale, and will replace it with data from other members of the replication group during the next replication. DFS Replication will move the stale files to the local Conflict folder. No user action is required.
Additional Information:
Error: 9061 (The replicated folder has been offline for too long.)
Replicated Folder Name: rf1
Replicated Folder ID: 5856C18F-CA72-4D2D-9D89-4CC1D8042D86
Replication Group Name: rg1
Replication Group ID: BC5976EF-997E-4149-819D-57193F21EC76
Member ID: FAEC4B17-E81F-4036-AAD9-78AA46814606

Note: this event has incorrect wording. The first two sentences in the description are good, but the following sentences are wrong. DFSR does not self-correct this situation, it does not move files into the ConflictAndDeleted folder, and you, the user, have actions you need to take. More on this later.

The above is Content Freshness protection in action. It is protecting your DFSR environment from sending divergent data out to the rest of your working servers.

Recovering DFSR from Content Protection

Important note: Before repairing the blocked replication, get a backup of the data on the affected server and its partners. Failure to do will tempt Murphy’s Law to disastrous new heights. Understand that by following these steps below, any DFSR data that was on this server and never replicated will be moved to PreExisting and/or ConflictAndDeleted – this server goes through non-authoritative sync again and loses all conflicts with other DFSR servers. You have been warned!!!

Also, whatever is being done to stop replication from working needs to be ironed out – whether it is leaving the service off for months on end or not having any connections. Otherwise this is just going to happen again.

To get things back in order, do the following:

1. Start DFSMGMT.MSC on the affected server.

2. On any affected replication groups this server is a member of, select the computer on the Membership tab and “Disable” it.

3. Accept the warning prompt.

4. If the reason for replication never occurring was the schedule being set to “no replication” on the RG or RF, or no bi-directional connections being place between servers, fix that situation now.

5. Force AD Replication and verify it has converged.

6. On the affected server, run:

DFSRDIAG.EXE POLLAD

7. Wait for the 4008 and 4114 events being written to the DFSR event log to confirm that the replicated folder(s) are no longer being replicated.

8. In DFSMGMT.MSC, “Enable” the replication again on the affected replicated folders for that server.

9. Force AD replication and POLLAD again.

The server goes through non-authoritative initial sync, as if it was setup the first time. All matching data is unchanged and does not replicate. Any files on the server that do not exist on its authoritative partner are moved to the PreExisting folder. Any files on the server that have been changed locally are moved to the ConflictAndDeleted folder and the authoritative server’s copy is replicated inbound.

The Sum Up

Content Freshness protection is a good thing and putting it in place may someday save you some real pain. Trust me – we work cases here where Content Freshness being enabled would have stopped huge problems. All it takes is Windows Server 2008 or later, and a few moments of your time.

Tags

We have a main office & 3 branches. Very recently we had power issues at 2 of the branches. Power company problem, power was up & down for a couple of days. Down long enough to drain battery backups.

All offices have their own WinSvr 08. Replication has been all setup and running fine for several months until the power problems. Now those 2 branches are not replicating. The only error that is showing up in the event viewrs is on the main server and it is event 5002, ‘The DFS Replication Service Encountered An Error Communicating With Partner Svr2 For Replication Group Main’.

There are no problems with the netowrk connections accross the WAN links, as all other programs are running fine, for example, AD replicated just fine. Have restared the services several times. Have also disabled the replication links for each of these offices, updated the AD, the re-enabled the links etc just like the recovery described in the blog here. It will then reconnect, but about 30 minutes later, goes into the same 5002 error.

The only thing I can think of to try that I haven’t yet is totally remove

these 2 offices from the repication sets, and then set them up all over

Recreating the replication group is unlikely to help you. You are likely having a network problem, but not in totality – specific kinds of RPC-ware systems (such as firewalls, intrusion protection, and other products) can make distinctions based on the RPC UID being used. They can also block specific ports, so that only one application will be affected but others will not. The 5002 error isn’t enough, I need the extended error. It may say ‘access denied’ or ‘security specific package error has occurred’ or ‘no more endpoints available’ or other things.

If that extended error you respond with doesn’t give an obvious answer, please open a support case to have this investigated; it will require significant analysis, network captures, port examination, installed 3rd party examination, etc etc. Prepare to spend some time on this.

Ned

9 years ago

KennyOats

Thanks for the quick reply Ned.

the following are the 2 specific errors that show up in the event viewer on the main server (Ntserver). The remote server is named Lebanon.

This is what comes up first after restarting:

Log Name: DFS Replication

Source: DFSR

Date: 12/8/2009 10:36:08 AM

Event ID: 5014

Task Category: None

Level: Warning

Keywords: Classic

User: N/A

Computer: NTSERVER.csoac.local

Description:

The DFS Replication service is stopping communication with partner LEBANON for replication group CSOAC Main due to an error. The service will retry the connection periodically.