Elsewhere

If the topology is set up for hub and spoke, and the spoke were to accidentally delete an item, this should not reflect back to the hub, correct? This should be a one way transfer. What we are seeing is our hub replicates out to the spokes perfectly, but if the spoke deletes an item, the item is then deleted from our hub share. It seems to be acting like a full mesh topology, but it was originally set up at as hub and spoke.

The behavior the customer describes is by design. Because DFS Replication is a multimaster replication engine, any change made on any spoke is replicated back to the hub and to the other spokes. To prevent changes from occurring on spokes, we recommend using shared folder permissions.

I too had always thought a hub-spoke design means the hub is the master server. But now I realize how wrong I was. Basically a hub-spoke or full mesh topology only determines the sync path – it does not denote which server is the master and which servers are slaves. DFSR, like AD, has no master or slave.

In a hub-spoke replication topology, two spoke servers will sync with each other via the hub server – that’s all! Neither server is “inferior” to the master in any way.

One of our DCs at work had the following DFSR warnings in the DFS Replication logs:

The DFS Replication service stopped replication on volume C:. This occurs when a DFSR JET database is not shut down cleanly and Auto Recovery is disabled. To resolve this issue, back up the files in the affected replicated folders, and then use the ResumeReplication WMI method to resume replication.

Recovery Steps
1. Back up the files in all replicated folders on the volume. Failure to do so may result in data loss due to unexpected conflict resolution during the recovery of the replicated folders.
2. To resume the replication for this volume, use the WMI method ResumeReplication of the DfsrVolumeConfig class. For example, from an elevated command prompt, type the following command:
wmic /namespace:\\root\microsoftdfs path dfsrVolumeConfig where volumeGuid=”56234A2C-C156-11E2-93E8-806E6F6E6111″ call ResumeReplication

For more information, see http://support.microsoft.com/kb/2663685.

Sounded like an easy fix, so I went ahead and tried resuming replication as directed. That didn’t work though. Got the following:

The DFS Replication service stopped replication on the folder with the following local path: D:\SYSVOL_DFSR\domain. This server has been disconnected from other partners for 154 days, which is longer than the time allowed by the MaxOfflineTimeInDays parameter (60). DFS Replication considers the data in this folder to be stale, and this server will not replicate the folder until this error is corrected.

To resume replication of this folder, use the DFS Management snap-in to remove this server from the replication group, and then add it back to the group. This causes the server to perform an initial synchronization task, which replaces the stale data with fresh data from other members of the replication group.

Check out this Microsoft blog post for content freshness and the MaxOfflineTimeInDays parameter. You can’t simple remove SYSVOL from DFSR replication groups via the GUI as it is a special folder, so you have to work around. I found some forum posts and blog posts that suggested simply raising this parameter for the broken server to a number larger than the number of days its currently been offline (154 in the above case) and then resuming replication. I wasn’t too comfortable with that. What if any older changes from this server now replicate to the other servers? That could cause more damage than it’s worth. I don’t think this will happen, but why take a risk. What I really want is to force a replication onto this server from some other server. Do a non-authoritative replication basically. So I followed the steps in this article and that worked.

A non-authoritative sync is like a regular sync, just that it is rigged to let the source win. :p So all the existing files on the destination server are preserved. The event log gets filled with entries like these:

The DFS Replication service detected that a file was changed on multiple servers. A conflict resolution algorithm was used to determine the winning file. The losing file was moved to the Conflict and Deleted folder.

Had an interesting problem at work yesterday about which I wish I could write a long and interesting blog post, but truthfully it was such a simple thing once I identified the cause.

We use AppV for streaming applications. We have many branch offices so there’s a DFS share which points to targets in each office. AppV installations in each office point to this DFS share and thanks to the magic of DFS referrals correctly pick up the local Content folder. From day-before, however, one of our offices started getting errors with AppV apps (same as in this post), and when I checked the AppV server I found errors similar to this in the Event Logs:

1

Empty packagemap forpackagecontent root[\\domain.local\dfs\Content]

The DFS share seemed to be working OK. I could open it via File Explorer and its contents seemed correct. I checked the number of files and the size of the share and they matched across offices. If I pointed the DFS share to use a different target (open the share in File Explorer, right click, Properties, go to the DFS tab and select a different location target) AppV works. So the problem definitely looked like something to do with the local target, but what was wrong?

I tried forcing a replication. And checked permissions and used tools like dfsrdiag to confirm things were alright. No issues anywhere. Restarting the DFS Replication service on the server threw up some errors in the Event Logs about some AD objects, so I spent some time chasing up that tree (looks like older replication groups that were still hanging around in AD with missing info but not present in the DFS Management console any more) until I realized all the replication servers were throwing similar errors. Moreover, adding a test folder to the source DFS share correctly resulted it in appearing on the local target immediately – so obviously replication was working correctly.

Later, while sitting through a boring conference call I had a brainwave that maybe the AppV service runs in a different user context and that may not be seeing the DFS share? As in, maybe the error message above is literally what is happening. AppV is really seeing an empty content root and it’s not a case of a corrupt content root or just some missing files?

So I checked the AppV service and saw that it runs as NT AUTHORITY\NETWORK SERVICE. Ah ha! That means it authenticates with the remote server with the machine account of the server AppV is running on. I thought I’d verify what happens by launching File Explorer or a Command Prompt as NT AUTHORITY\NETWORK SERVICE but this was a Server 2003 and apparently there’s no straightforward way to do that. (You can use psexec to launch something as .\LOCALSYSTEM and starting from Server 2008 you can create a scheduled task that runs as NT AUTHORITY\NETWORK SERVICE and launch that to get what you want but I couldn’t use that here; also, I think you need to first run as the .\LOCALSYSTEM account and then run as the NT AUTHORITY\NETWORK SERVICE account). So I checked the Audit logs of the server hosting the DFS target and sure enough found errors that the machine account of the AppV server was indeed being denied login:

I fired up the Local Security Policy console on the server hosting the DFS target (it’s under the Administrative Tools folder, or just type secpol.msc). Then went down to “Local Policies” > “User Rights Assignment” > “Access this computer from the Network”:

Sure enough this was limited to a set of computers which didn’t include the AppV server. When I compared this with our DFS servers I saw that they were still on the default values (which includes “Everyone” as in the screenshot above) and that’s why those targets worked.

To dig further I used gpresult and compared the GPOs that affected the above policy between both servers. The server that was affected had this policy modified via GPO while the server that wasn’t affected showed the GPO as inaccessible. Both servers were in the same OU but upon examining the GPO I saw that it was limited to a certain group only. Nice! And when I checked that group our problem server was a member of it while the rest weren’t! :)

Turns out the server was added to the group by error two days ago. Removed the server from this group, waited a while for the change across the domain, did a gpupdate on the server, and tada! now the AppV server is able to access the DFS share on this local target again. Yay!

Moral of the story: if one of your services is unable to access a shared folder, check what user account the service runs as.

LocatorCheck

Checks whether DCs have certain required knowledge/ ability. Specifically, whether the DC that’s tested knows of or can be a:

The Global Catalog (GC)

The Primary Domain Controller (PDC)

Kerberos Key Distribution Centre (KDC)

Time Server

Preferred Time Server

By itself the test doesn’t output much info:

1

2

3

4

5

6

7

C:\Windows\system32>dcdiag/test:LocatorCheck/s:win-dc03

Runningenterprisetestson:rakhesh.local

Startingtest:LocatorCheck

.........................rakhesh.localpassedtestLocatorCheck

To get more details one has to use the /v switch. Then output similar to the following will be returned:

1

2

3

4

5

6

7

8

9

10

11

12

Startingtest:LocatorCheck

GCName:\\WIN-DC01.rakhesh.local

LocatorFlags:0xe000f3fd

PDCName:\\WIN-DC01.rakhesh.local

LocatorFlags:0xe000f3fd

TimeServerName:\\WIN-DC03.rakhesh.local

LocatorFlags:0xe000f1f8

PreferredTimeServerName:\\WIN-DC01.rakhesh.local

LocatorFlags:0xe000f3fd

KDCName:\\WIN-DC03.rakhesh.local

LocatorFlags:0xe000f1f8

.........................rakhesh.localpassedtestLocatorCheck

Note that the DC itself needn’t be offering one of the servers. But it must know who else offers these and be able to refer. For instance, in the case of my domain WIN-DC03 (the server I am testing against) isn’t a GC or PDC so it returns WIN-DC01 as these. It is a time server, but is not a preferred time server (as that’s the forest root domain PDC), so the output is accordingly.

(In this case the router between the two sites was shutdown and so Intersite replication was failing. Hence the errors above.

This test doesn’t seem to force an Intersite replication. It only connects to the servers and checks for errors, I think. For instance, when I turned on the router above and verified the two DCs can see each other, forced an enterprise wide replication (repadmin /syncall win-dc01 /e /A) (tell WIN-DC01 to ask all its partners to replication, enterprise-wide, all NCs), and double checked the replication status (repadmin.exe /showrepl WIN-DC01) – everything was working fine, but the Intersite test still complains. Not the same errors as above, but different errors. The test passes but there are warnings that each site doesn’t have a Bridgehead yet because of errors. After about 15 mins the errors clears.

Intersite replication, Bridgeheads, and InterSite Topology Generators (ISTG) are part of later posts.

KccEvent

Checks whether the Knowledge Consistency Checker (KCC) has any errors.

This test only checks the “Directory Services” event log of the specified server for any errors in the last 15 mins. (If you run the test with the /v switch it even says so).

KnowsOfRoleHolders

Checks whether the DC knows of various Flexible Single Master Operations (FSMO) role holders in the domain. (FSMO is part of a later post so I won’t elaborate it here).

By default the answer is just a pass or fail.

Use with the /v switch to know what the DC thinks it knows:

1

2

3

4

5

6

7

8

9

10

11

12

Startingtest:KnowsOfRoleHolders

RoleSchemaOwner=CN=NTDSSettings,CN=WIN-DC01,CN=Servers,CN=COCHIN,C

N=Sites,CN=Configuration,DC=rakhesh,DC=local

RoleDomainOwner=CN=NTDSSettings,CN=WIN-DC01,CN=Servers,CN=COCHIN,C

N=Sites,CN=Configuration,DC=rakhesh,DC=local

RolePDCOwner=CN=NTDSSettings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=S

ites,CN=Configuration,DC=rakhesh,DC=local

RoleRidOwner=CN=NTDSSettings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=S

ites,CN=Configuration,DC=rakhesh,DC=local

RoleInfrastructureUpdateOwner=CN=NTDSSettings,CN=WIN-DC01,CN=Serv

ers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local

.........................WIN-DC01passedtestKnowsOfRoleHolders

Good test to run after a role change to see whether all DCs in the domain/ enterprise know of the new role holder.

MachineAccount

Checks whether the DC’s machine account exists, is in the Domain Controllers OU, and Service Principal Names (SPNs) are correctly registered.

This is yet another test that only returns a pass or fail by default. Use with the /v switch to get a list of the registered SPNs.

Notice that the CheckSecurityError test also checks SPNs. CheckSecurityError is only run on demand, however.

Add the /RecreateMachineAccount switch to recreate the machine account if missing. Note: this does not recreate missing SPNs.

Add the /FixMachineAccount switch to fix if the machine account flags are incorrect (am not sure what flags these are …).

NCSecDesc

Checks whether all the Naming Contexts on the DC have correct security permissions for replication.

NetLogons

Checks whether the Netlogon and SYSVOL shares are available and can be accessed.

I pointed out this test previously under the SysVolCheck test. The latter gives the impression it actually checks the SYSVOL shares, but it doesn’t. NetLogons is the one that checks.

ObjectsReplicated

Checks whether the DCs machine account and DSA objects have replicated. The DC machine account object is CN=,OU=Domain Controllers,... in the domain NC; the DSA object is CN=NTDS Settings,CN=,CN=Servers,CN=,... in the configuration NC.

This test is better run with the /a or /e switches. Without these switches it only checks the DC you test against to see whether it has its own objects. With the switches it checks all the objects for all DCs in the site/ enterprise on all DCs in the site/ enterprise. Which is what you really want.

It is also possible to check a specific object via the /objectdn: or limit to DCs holding a specific NC via the /n: switch.

For example:

Check all DCs holding the default naming context (rakhesh.local) across all sites:

1

>dcdiag/test:objectsreplicated/e/n:rakhesh.local

Check al DCs holding a specified application NC across all sites:

1

dcdiag/test:objectsreplicated/e/n:SomeApp2.rakhesh.local/s:win-dc02

I had created the SomeApp2 previously. It is only replicated to the WIN-DC01 and WIN-DC03 servers so the test above will only check those servers. (To recap: you can find the DCs a NC is replicated to from the ms-DS-NC-Replica-Locations attribute of its object in the Partitions container). Note that I had to specify a server above. That’s because without specifying a DC name there’s no way to identify which DCs know of this NC (Note: “know of”, its not necessary they hold the NC, they should only know where to point to). Unlike a domain NC which has DNS entries to help identify the DCs holding it, other NCs have no such mechanism. Below is the error you get if you don’t specify a DC name as above:

1

2

3

4

5

6

7

8

9

10

11

ThedistinguishednameofthedomainisDC=someapp2,DC=rakhesh,DC=local.

DirectoryServerDiagnosis

Performinginitialsetup:

FindingserverfordomainDC=someapp2,DC=rakhesh,DC=local...

ADirectoryServerholdingsomeapp2.rakhesh.localcouldnotbelocated.

Theerroris

Thespecifieddomaineitherdoesnotexistorcouldnotbecontacted.

Tryspecifyingaserverwiththe/soption.

ERROR:Couldnotfindtheserver.

Lastly, it’s also possible to check for the replication status of a specific object. Very useful for testing purposes. Make a test object on one DC, force a replication, wait some time, then test whether that object has replicated to all DCs in your site/ enterprise. (Sure you could connect to each DC via ADUC or ADSIEdit, but this is way more convenient!)

Below command checks whether the specified user account has replicated to all DCs in the domain:

I specify a NC above (the /n switch) because I am running DCDiag from a client so I must specify either a server to use (the /s switch) or a NC based on which a DC can be found. If run from a DC then the NC can be omitted.

OutboundSecureChannels

Checks whether all DCs in the domain (by default only those in the current site) have a secure channel to DCs in the trusted domain specified by the /testdomain: switch.

There seems to be a misunderstanding that this test checks secure channels between DCs of the same domain. That’s not the case, it’s between DCs of two trusted domains.

Use the /nositerestriction switch to not limit the test to all DCs in the same site.

This test is not run by default. It must be explicitly specified.

RegisterInDNS

Checks whether the server being tested can register “A” DNS records. The DNS domain name must be specified via the /DnsDomain: switch.

This test is similar to the DcPromo test mentioned previously.

This test isn’t run by default.

Replications

Checks whether all of the DCs replication partners are able to replicate to it. By default only those in the same site are tested.

It contacts each of the partners to get a status update from them. The test also checks whether there’s a replication latency of more than 12 hours.

Output from WIN-DC01 in my domain when I disconnected its partner WIN-DC03. WIN-DC02 is not checked as it’s in a different site.

1

2

3

4

5

6

7

8

9

10

11

>dcdiag/test:replications/s:win-dc01/v

<snip>

Starting test:Replications

*Replications Check

[WIN-DC03]DsBindWithSpnEx()failed with error1722,

The RPC server is unavailable..

RPC Extended Error Info not available.Use grouppolicy on the local

machine at"Computer Configuration/Administrative

Templates/System/Remote Procedure Call"to enable it.

.........................WIN-DC01failed test Replications

RidManager

Checks whether the DC with the RID Master FSMO role is accessible and contains proper information. Use with the /v to get more details on the findings (allocation pool, next available, etc).

Example output:

1

2

3

4

5

6

7

8

9

10

11

12

>dcdiag/test:RidManager/s:win-dc02/v

<snip>

Startingtest:RidManager

*AvailableRIDPoolfortheDomainis2602to1073741823

*WIN-DC01.rakhesh.localistheRIDMaster

*DsBindwithRIDMasterwassuccessful

*rIDAllocationPoolis1602to2101

*rIDPreviousAllocationPoolis1602to2101

*rIDNextRID:1611

.........................WIN-DC02passedtestRidManager

Services

Checks whether various AD required services are running the DC.

Following services are tested:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

NameDisplayName

---------------

EventSystemCOM+EventSystem

RpcSsRemoteProcedureCall(RPC)

NTDSActiveDirectoryDomainServices

DnsCacheDNSClient

DFSRDFSReplication

IsmServIntersiteMessaging

kdcKerberosKeyDistributionCenter

SamSsSecurityAccountsManager

LanmanServerServer

LanmanWorkstationWorkstation

w32timeWindowsTime

NetlogonNETLOGON

This list is similar (not same!) to the DC critical services list. Notably it doesn’t check if the “DNS Server” and “AD WS” services are running.

SystemLog

Checks the System Log for any errors in the last 60 mins (or less if the server uptime is less than 60 mins).

Topology

Checks whether the server has a fully connected topology for replication of each of its NCs.

Note that the test does not actually check if the servers in the topology are online/ connected. For that use the Replications and CutOffServers tests. This test only checks if the topology is logically fully connected.

This test is not run by default. It must be explicitly specified.

VerifyEnterpriseReferences

and

VerifyReferences

Checks whether system references required for the FRS and replication infrastructure are present on each DCs. The “Enterprise” variant tests whether references for replication to all DCs in the enterprise are present.

Note: I am not very clear what this test does (but feel free to look at Ned’s blog post for more info) and I have been writing this post over many days so I am too lazy to research further either. :) I’ll update this post later if I find more info on the test.

This test is not run by default. It must be explicitly specified.

VerifyReplicas

Checks whether all the application NCs have replicated to the DCs that should contain a copy.

Seems to be similar to the CheckSDRefDom test but more concerned with whether the DCs host a copy or not.

This post originally began as notes on troubleshooting Domain Controller critical services. But when I started talking about DcDiag I went into a tangent explaining each of its tests. That took much longer than I expected – both in terms of effort and post length – so I decided to split it into a post of its own. My notes aren’t complete yet, what follows below is only the first part.

DcDiag is your best friend when it comes to troubleshooting AD issues. It is a command-line tool that can identify issues with AD. By default DcDiag will run a series of “default” tests on the DC it is invoked, but it can be asked to run more tests and also test multiple DCs in the site (the /a switch) or across all sites (the /e switch). A quick glance at the DcDiag output is usually enough to tell me where to look further.

For instance, while typing this post I ran DcDiag to check all DCs in one of my sites:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

C:\Users\Administrator>dcdiag/a

DirectoryServerDiagnosis

Performinginitialsetup:

Tryingtofindhomeserver...

HomeServer=WIN-DC01

*IdentifiedADForest.

Donegatheringinitialinfo.

Doinginitialrequiredtests

Testingserver:COCHIN\WIN-DC01

Startingtest:Connectivity

.........................WIN-DC01passedtestConnectivity

Testingserver:COCHIN\WIN-DC03

Startingtest:Connectivity

.........................WIN-DC03passedtestConnectivity

Doingprimarytests

Testingserver:COCHIN\WIN-DC01

Startingtest:Advertising

.........................WIN-DC01passedtestAdvertising

Startingtest:FrsEvent

.........................WIN-DC01passedtestFrsEvent

Startingtest:DFSREvent

.........................WIN-DC01passedtestDFSREvent

Startingtest:SysVolCheck

.........................WIN-DC01passedtestSysVolCheck

Testingserver:COCHIN\WIN-DC03

Startingtest:Advertising

.........................WIN-DC03passedtestAdvertising

Startingtest:FrsEvent

.........................WIN-DC03passedtestFrsEvent

Startingtest:DFSREvent

TheeventlogDFSReplicationonserverWIN-DC03.rakhesh.localcould

notbequeried,error0x6ba"The RPC server is unavailable."

.........................WIN-DC03failedtestDFSREvent

Startingtest:SysVolCheck

.........................WIN-DC03passedtestSysVolCheck

Startingtest:KccEvent

TheeventlogDirectoryServiceonserverWIN-DC03.rakhesh.localcould

notbequeried,error0x6ba"The RPC server is unavailable."

.........................WIN-DC03failedtestKccEvent

Runningpartitiontestson:ForestDnsZones

Startingtest:CheckSDRefDom

.........................ForestDnsZonespassedtestCheckSDRefDom

Startingtest:CrossRefValidation

.........................ForestDnsZonespassedtest

CrossRefValidation

Runningpartitiontestson:DomainDnsZones

Startingtest:CheckSDRefDom

.........................DomainDnsZonespassedtestCheckSDRefDom

Startingtest:CrossRefValidation

.........................DomainDnsZonespassedtest

CrossRefValidation

I ran the above from WIN-DC01 and you can see I was straight-away alerted that WIN-DC03 could be having issues. I say “could be” because the errors only say that DcDiag cannot contact the RPC server on WIN-DC03 to check for those particular tests – this doesn’t necessarily mean WIN-DC03 is failing those tests, just that maybe there’s a firewall blocking communication or perhaps the RPC service is down. To confirm this I ran the same test on WIN-DC03 and they succeeded, indicating that WIN-DC03 itself is fine so there’s a communication problem between DcDiag on WIN-DC01 and WIN-DC03. Moreover, DcDiag from WIN-DC03 can query WIN-DC01 so the issue is with WIN-DC03. (In this particular case it was the firewall on WIN-DC03).

Here’s a list of the tests DcDiag can perform:

Advertising

Checks whether the Directory System Agent (DSA) is advertising itself. The DSA is a set of services and processes running on every DC. The DSA is what allows clients to access the Active Directory data store. Clients talk to DSA using LDAP (used by Window XP and above), SAM (used by Windows NT), MAPI RPC (used by Exchange server and other MAPI clients), or RPC (used by DCs/DSAs to talk to each other and replicate AD information). More info on the DSA can be found in this Microsoft document.

You can think of the DSA as the kernel of the DC – the DSA is what lets a DC behave like a DC, the DSA is what we are really talking about when referring to DCs.

Although DNS is used by domain members (and other DCs) to locate DCs in the domain, for a DC to be actually used by others the DSA must be advertising the roles it provides. The nltest command can be used to view what roles a DSA is advertising. For example:

1

2

3

4

5

6

7

8

9

10

11

C:\Windows\system32>nltest/dsgetdc:rakhesh.local/server:win-dc01

DC:\\WIN-DC01.rakhesh.local

Address:\\fdcc:7c4e:3651:1::20

DomGuid:6583a216-68d3-48f4-8145-01162042ccdf

DomName:rakhesh.local

ForestName:rakhesh.local

DcSiteName:COCHIN

OurSiteName:COCHIN

Flags:PDCGCDSLDAPKDCTIMESERVGTIMESERVWRITABLEDNS_DCDNS_DOMAIN

DNS_FORESTCLOSE_SITEFULL_SECRETWSDS_8DS_9

Thecommandcompletedsuccessfully

Notice the flags section. Among other things the DSA advertises that this DC holds the PDC FSMO role (PDC), is a Global Catalog (GC), and that it is a reliable time source (GTIMESERV). Compare the above output with another DC:

1

2

3

4

5

6

7

8

9

10

11

C:\Windows\system32>nltest/dsgetdc:rakhesh.local/server:win-dc03

DC:\\WIN-DC03.rakhesh.local

Address:\\fdcc:7c4e:3651:1::22

DomGuid:6583a216-68d3-48f4-8145-01162042ccdf

DomName:rakhesh.local

ForestName:rakhesh.local

DcSiteName:COCHIN

OurSiteName:COCHIN

Flags:DSLDAPKDCTIMESERVWRITABLEDNS_DCDNS_DOMAINDNS_FORESTCLOSE_

SITEFULL_SECRETWSDS_8DS_9

Thecommandcompletedsuccessfully

The PDC, GC, and GTIMESERV flags advertised by WIN-DC01 are missing here because this DC does not provide any of those roles. Being a DC it can act as a time source for domain member, hence the TIMESERV flag is present.

When DCs replicate they refer to each other via the DSA name rather than the DC name (further enforcing my point from before that the DSA can be thought of as the kernel of the DC – it is what really matters).

1

2

3

4

5

6

7

8

9

C:\Windows\system32>repadmin/syncall win-dc03

CALLBACK MESSAGE:The following replication is inprogress:

From:bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local

To:33398129-7632-4014-a3b4-eabb2b74de8b._msdcs.rakhesh.local

CALLBACK MESSAGE:The following replication completed successfully:

From:bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local

To:33398129-7632-4014-a3b4-eabb2b74de8b._msdcs.rakhesh.local

CALLBACK MESSAGE:SyncAll Finished.

SyncAll terminated with no errors.

That is why even though a DC in my domain may have the DNS name WIN-DC01.rakhesh.local, in the additional structure that’s used by AD (which I’ll come to later) there’s an entry such as bdb02ab9-5103-4254-9403-a7687ba91488._msdcs.rakhesh.local which is a CNAME to the regular name. These CNAME entries are created by the Netlogon service and are of the format DsaGuid._msdcs.DNSForestName – the CNAME hostname is actually the GUID of the DSA.

If you open Active Directory Sites and Services, drill down to a site, then Servers, then expand a particular server – you’ll see the “NTDS Settings” object. This is the DSA. If you right click this object, go to Properties, and select the “Attribute Editor” tab, you will find an attribute called objectGUID. That is the GUID of the DSA – the same GUID that’s there in the CNAME entry.

CheckSDRefDom

Before talking about CheckSDRefDom it’s worth talking about directory partitions (also called as Naming Contexts (NC)).

An AD domain is part of a forest. A forest can contain many domains. All these domains share the same schema and configuration, but different domain data. Every DC in the forest thus has some data that’s particular to the domain it belongs to and is replicated with other DCs in the domain; and other data that’s common to the forest and replicated with all DCs in the forest. These are what’s referred to as directory partitions / naming contexts.

Every DC has four directory partitions. These can be viewed using ADSI Edit (adsiedit.msc) tool.

“Default naming context” (also known as “Domain”) which contains the domain specific data;

“Configuration” (CN=Configuration,DC=forestRootDomain) which contains the configuration objects for the entire forest; and

“Schema” (CN=Schema,CN=Configuration,DC=forestRootDomain) which contains class and attribute definitions for all existing and possible objects of the forest. Even though the Schema partition hierarchically looks like it is under the Configuration partition, it is a separate partition.

“Application” (CN=...,CN=forestRootDomain – there can be many such partitions) which was introduced in Server 2003 and are user/ application defined partitions that can contain any object except security principals. The replication of these partitions is not bound by domain boundaries – they can be replicated to selected DCs in the forest even if they are in different domains.

A common example of Application partitions are CN=ForestDnsZones,CN=forestRootDomain and CN=DomainDnsZones,CN=forestRootDomain which hold DNS zones replicated to all DNS servers in the forest and domain respectively (note that it is not replicated to all DCs in the forest and domain respectively, only a subset of the DCs – the ones that are also DNS servers).

If you open ADSI Edit and connect to the RootDSE “context”, then right click the RootDSE container and check its namingContexts attribute you’ll find a list of all directory partitions, including the multiple Application partitions.

Here you’ll also find other attributes such as:

defaultNamingContext (DN of the Domain directory partition the DC you are connected to is authoritative for),

rootNamingContext (DN of the Domain directory partition for the Forest Root domain)

The Configuration partition has a container called Partitions (CN=Partitions,CN=Configuration,DC=forestRootDomain) which contains cross-references to every directory partition in the forest – i.e. Application, Schema, and Configuration directory partitions, as well as all Domain directory partitions. The beauty of cross-references is that they are present in the Configuration partition and hence replicated to all DCs in the forest. Thus even if a DC doesn’t hold a particular NC it can check these cross-references and identify which DC/ domain might hold more information. This makes it possible to refer clients asking for more info to other domains.

What the CheckSDRefDom test does is that it checks whether the cross-references have an attribute called msDS-SDReferenceDomain set.

What does this mean?

An Application NC, by definition, isn’t tied to a particular domain. That makes it tricky from a security point of view because if its ACL has security descriptor referring to groups/ users that could belong to any domain (e.g. “Domain Admins”, “Administrator”) there’s no way to identify which domain must be used as the reference.

To avoid such situations, cross references to Application directory partitions contain an msDS-SDReferenceDomain attribute which specifies the reference domain.

So what the CheckSDRefDom test really does is that it verifies all the Application directory partitions have a reference domain set.

In case a reference domain isn’t set, you can always set it using ADSI Edit or other techniques. You can also delegate this.

CheckSecurityError

Checks for any security related errors on the DC that might be causing replication issues.

Some of the tests done are:

Verify that KDC is working (not necessarily on the target DC, the test only checks that a KDC server is reachable anywhere in the domain, preferably in the same site; even if the target DC KDC service is down but some other KDC server is reachable the test passes)

Verify that the DC”s computer object exists and is within the “Domain Controllers” OU and replicated to other DCs

CrossRefValidation

Application NCs are actually objects of a class domainDNS with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).

You can create an application NC, for instance, by opening up ADSI Edit and going to the Domain NC, right click, new object, of type domainDNS, enter a Domain Component (DC) value what you want, click Next, then click “More Attributes”, select to view Mandatory/ Both type of properties, find instanceType from the property drop list, and enter a value of 5.The above can be done anywhere in the domain NC. It is also possible to nest application NCs within other application NCs.

Here’s what happens behind the scenes when you make an application NC as above:

The application NC isn’t created straight-way.

First, the the DSA will check the cross-references in CN=Partitions,CN=Configuration,DC=forestRootDomain to see if one already exists to an Application NC with the same name as you specified.

If a cross-reference is found and the NC it points to actually exists then an error will be thrown.

If a cross-reference is found but the NC it points to doesn’t exist, then that cross-reference will be used for the new Application NC.

If a cross-reference cannot be found, a new one will be created.

Cross references (objects of class crossRef) have some important attributes:

CN – the CN of this cross-reference (could be a name such as “CN=SomeApp” or a random GUID “CN=a97a34e3-f751-489d-b1d7-1041366c2b32”)

nCName – the DN of the application NC (e.g. DC=SomeApp,DC=rakhesh,DC=local)

dnsRoot – the DNS domain name where servers that contain this NC can be found (e.g. SomeApp.rakhesh.local).

(Note this as it’s quite brilliant!) When a new application NC is created, DSA also creates a corresponding zone in DNS. This zone contains all the servers that carry this zone. In the screenshot below, for instance, note the zones DomainDnsZones, ForestDnsZones, and SomeApp2 (which belongs to a zone I created). Note that by querying for all SRV records of name _ldap in _tcp.SomeApp2.rakhesh.local one can easily find the DCs carrying this partition: For the example above, dnsRoot would be “SomeApp2.rakhesh.local” as that’s the DNS domain name.

msDS-NC-Replica-Locations – a list of Distinguished Names (DNs) of DSAs where this application NC is replicated to (e.g. CN=NTDS Settings,CN=WIN-DC01,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local, CN=NTDS Settings,CN=WIN-DC03,CN=Servers,CN=COCHIN,CN=Sites,CN=Configuration,DC=rakhesh,DC=local). Initially this attribute has only one entry – the DC where the NC was first created. Other entries can be added later.

Enabled – usually not set, but if it’s set to FALSE it indicates the cross-reference is not in use

Once a cross-reference is identified (an existing or a new one) the Configuration NC is replicated through the forest. Then the Application NC is actually created (an object of class domainDNS object as mentioned earlier with an instanceType attribute value of 5 (DS_INSTANCETYPE_IS_NC_HEAD | DS_INSTANCETYPE_NC_IS_WRITEABLE).

Lastly, all DCs that hold a copy of this Application NC have their ms-DS-Has-Master-NCs attribute in the DSA object modified to include a DN of this NC.

Back to the CrossRefValidation test, it validates the cross-references and the NCs they point to. For instance:

Ensure the DN (and CN) are not mangled (in case of conflicts AD can “mangle” the names to reflect that there’s a conflict) (see here for an example of mangled entries)

CutoffServers

If you open AD Sites and Services, expand down to each site, the servers within them, and the NTDS Settings object under each server (which is basically the DSA), you can see the replication partners of each server. For instance here are the partners for two of my servers in one site:

Reason WIN-DC01 has links to both WIN-DC03 (in the same site as it) and WIN-DC02 (in a different site) while WIN-DC03 only has links to WIN-DC01 (and not WIN-DC02 which is in a different site) is because WIN-DC01 is acting as a the bridgehead server. The bridgehead server is the server that’s automatically chosen by AD to replicate changes between sites. Each site has a bridgehead server and these servers talk to each other for replication across the site link. All other DCs in the site only get inter-site changes via the bridgehead server of that site. More on it later when I talk about bridgehead servers some other day … for now this is a good post to give an intro on bridgehead servers.

WIN-DC02, which is my DC in the other site, similarly has only one replication partner WIN-DC01. So WIN-DC01 is kind of link the link between WIN-DC02 and WIN-DC03. If WIN-DC01 were to be offline then WIN-DC02 and WIN-DC03 would be cut off from each other (for a period until the mechanism that creates the topology between sites kicks in and makes WIN-DC03 the bridgehead server between site; or even forever if I “pin” WIN-DC01 as my preferred bridgehead server in which case when it goes down no one else can takeover). Or if the link that connects the two sites to each other were to fail again they’d be cut-off from each other.

So what the CutoffServers test does is that it tells you if any servers are cut-off from each other in the domain.

This test is not run by default. It must be explicitly specified.

This test is best run with the /e switch – which tells DcDiag to test all servers in the enterprise, across sites. In my experience is it’s run against a specific server it usually passes the test even if replication is down.

Also in my experience a server is up and running but only LDAP is down (maybe the AD DS service is stopped for instance) – and so it can’t replicate with partners and they are cut-off – the test doesn’t identify the servers as being cut-off. If the server/ link is down then the other servers are highlighted as cut-off.

For example I set WIN-DC01 as the preferred bridgehead in my scenario above. Then I disconnect it from the network, leaving WIN-DC02 and WIN-DC03 cut-off.

However if I run CutoffServers for the enterprise both WIN-DC02 and WIN-DC03 are correctly flagged:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

C:\Windows\system32>dcdiag/test:cutoffservers/n:rakhesh.local/e

ThedistinguishednameofthedomainisDC=rakhesh,DC=local.

DirectoryServerDiagnosis

Performinginitialsetup:

FindingserverfordomainDC=rakhesh,DC=local...

Serverfordomain=WIN-DC03.rakhesh.local

*IdentifiedADForest.

LdapsearchcapabilityattributesearchfailedonserverWIN-DC01,return

value=81

GoterrorwhilecheckingiftheDCisusingFRSorDFSR.Error:

Win32Error81TheVerifyReferences,FrsEventandDfsrEventtestsmightfail

becauseofthiserror.

Donegatheringinitialinfo.

Doinginitialrequiredtests

Testingserver:COCHIN\WIN-DC01

Startingtest:Connectivity

ServerWIN-DC01resolvedtotheseIPaddresses:10.50.0.20,butnone

oftheaddressescouldbereached(pinged).Pleasecheckthenetwork.

Error:0x2b02"Error due to lack of resources."

Thiserrormoreoftenmeansthatthetargetedserverisshutdownor

disconnectedfromthenetwork.

GoterrorwhilecheckingLDAPandRPCconnectivity.Pleasecheckyour

firewallsettings.

.........................WIN-DC01failedtestConnectivity

Testingserver:KOTTAYAM\WIN-DC02

Startingtest:Connectivity

.........................WIN-DC02passedtestConnectivity

Testingserver:COCHIN\WIN-DC03

Startingtest:Connectivity

.........................WIN-DC03passedtestConnectivity

Doingprimarytests

Testingserver:COCHIN\WIN-DC01

Testingserver:KOTTAYAM\WIN-DC02

Startingtest:CutoffServers

UpstreamtopologyisdisconnectedforDC=rakhesh,DC=local.

HomeserverWIN-DC02can't get changes from these servers:

COCHIN/WIN-DC03

Downstream topology is disconnected for DC=rakhesh,DC=local.

Home server WIN-DC02 can'tgetchangesfromtheseservers:

COCHIN/WIN-DC03

.........................WIN-DC02failedtestCutoffServers

Testingserver:COCHIN\WIN-DC03

Startingtest:CutoffServers

UpstreamtopologyisdisconnectedforDC=rakhesh,DC=local.

HomeserverWIN-DC03can't get changes from these servers:

KOTTAYAM/WIN-DC02

Downstream topology is disconnected for DC=rakhesh,DC=local.

Home server WIN-DC03 can'tgetchangesfromtheseservers:

KOTTAYAM/WIN-DC02

.........................WIN-DC03failedtestCutoffServers

Runningpartitiontestson:rakhesh

Runningenterprisetestson:rakhesh.local

Not only is WIN-DC01 flagged in the Connectivity tests but the CutoffServers test also fails WIN-DC02 and WIN-DC03.

The /v switch (verbose) is also useful with this test. It will also show which NCs are failing due to the server being cut-off.

DcPromo

Checks whether from a DNS point of view the target server can be made a Domain Controller. If the test fails suggestions given.

The test has some mandatory switches:

/dnsdomain:...

/NewForest (a new forest) or /NewTree (a new domain in the forest you specify via /ForestRoot:...)or /ChildDomain (a new child domain) or /ReplicaDC (another DC in the same domain)

Needless to say this test isn’t run by default.

DNS

Checks the DNS health of the whole enterprise. It has many sub-tests. By default all sub-tests except one are run, but you can do specific sub-tests too.

This TechNet page is a comprehensive source of info on what the DNS test does. Tests include checking for zones, network connectivity, client configuration, delegations, dynamic updates, name resolution, and so on.

This test is not run by default.

Since it is an enterprise-wide test DcDiag requires Enterprise Admin credentials to run tests.

FrsEvent

Checks for any errors with the File Replication System (FRS).

It doesn’t seem to do an active test. It only checks the FRS Event Logs for any messages in the last 24 hours. If FRS is not used in the domain the test is silently skipped. (Specifying the /v switch will show that it’s being skipped).

Take the results with a pinch of salt. Chances are you had some errors but they are now resolved, but since the last 24 hours worth of logs are checked the test will flag previous error messages. Also, FRS may being used for non-SYSVOL replication and these might have errors but that doesn’t really matter as far as the DCs are concerned.

There may also be spurious errors a server’s Event Log is not accessible remotely and so the test fails.

DFSREvent

Checks for any errors with the Distributed File System Replication (DFSR).

Similar to the FrsEvent test. Same caveats apply.

SysVolCheck

Checks whether the SYSVOL share is ready.

In my experience this doesn’t test doesn’t seem to actually check whether the SYSVOL share is accessible. For example, consider the following:

As an aside, in the case above the Netlogons test will flag the share as missing:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

C:\Windows\system32>dcdiag

DirectoryServerDiagnosis

Performinginitialsetup:

Tryingtofindhomeserver...

HomeServer=WIN-DC02

*IdentifiedADForest.

Donegatheringinitialinfo.

Doinginitialrequiredtests

Testingserver:KOTTAYAM\WIN-DC02

Startingtest:Connectivity

.........................WIN-DC02passedtestConnectivity

Startingtest:NetLogons

UnabletoconnecttotheSYSVOLshare!(\\WIN-DC02\sysvol)

[WIN-DC02]AnnetuseorLsaPolicyoperationfailedwitherror67,

Thenetworknamecannotbefound..

.........................WIN-DC02failedtestNetLogons

There is a registry keyHKLM\System\CurrentControlSet\Services\Netlogon\Parameters\SysvolReady which has a value of 1 when SYSVOL is ready and a value of 0 when SYSVOL is not ready. Even if I turn this value to 0 – thus disabling SYSVOL, the SYSVOL and NETLOGON shares stop being shared – the SysvolCheck test still passes. NetLogons flags an error though.

The first of my (hopefully!) many posts on Active Directory, based on the WorkshopPLUS sessions I attended last month. Progress is slow as I don’t have much time, plus I am going through the slides and my notes and adding more information from the Internet and such.

This one’s on the services that are critical for Domain Controllers to function properly.

DHCP Client

In Server 2003 and before the DHCP Client service registers A, AAAA, and PTR records for the DC with DNS

In Server 2008 and above this is done by the DNS Client

Note that only the A and PTR records are registered. Other records are by the Netlogon service.

File Replication Services (FRS)

Replicates SVSVOL amongst DCs.

Starting with Server 2008 it is now in maintenance mode. DFSR replaces it.

To check whether your domain is still using FRS for SYSVOL replication, open the DFS Management console and see whether the “Domain System Volume” entry is present under “Replication” (if it is not, see whether it is available for adding to the display). If it is present then your domain is using DFSR for SYSVOL replication.

Alternatively, type the following command on your DC. If the output says “Eliminated” as below, your domain is using DFSR for SYSVOL. (Note this only works with domain functional level 2008 and above).

1

2

3

4

C:\>dfsrmig/getglobalstate

CurrentDFSRglobalstate:'Eliminated'

Succeeded.

Stopping FRS for long periods can result in Group Policy distribution errors as SYSVOL isn’t replicated. Event ID 13568 in FRS log.

Apart from the dfsrmig command mentioned in the FRS section, the HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\DFSR\Parameters\SysVols\Migrating Sysvols\LocalStateregistry key can also be checked to see if DFSR is in use (a value of 3 means it is in use).

If a DC is offline/ disconnected from its peers for a long time and Content Freshness Protection is turned on, when the DC is online/ reconnected DFSR might block SYSVOL replications to & from this DC – resuling in Group Policy distribution errors.

Content Freshness Protection is off by default. It needs to be manually turned on for each server.

Content Freshness Protection exists because of the way deletions work.

DFSR is multi-master, like AD, which means changes can be made on any server.

When you delete an item on one server, it can’t simply be deleted because then the item won’t exist any more and there’s no way for other servers to know if that’s the case because the item was deleted or because it wasn’t replicated to that server in the first place.

So what happens is that a deleted item is “tombstoned“. The item is removed from disk but a record for it remains the in DFSR database for 60 days (this period is called the “tombstone lifetime”) indicating this item as being deleted.

During these 60 days other DFSR servers can learn that the item is marked as deleted and thus act upon their copy of the item. After 60 days the record is removed from the database too.

In such a context, say we have DC that is offline for more than 60 days and say we have other DCs where files were removed from SYSVOL (replicated via DFSR). All the other DCs no longer have a copy of the file nor a record that it is deleted as 60 days has past and the file is removed for good.

When the previously offline DC replicates, it still has a copy of the file and it will pass this on to the other DCs. The other DCs don’t remember that this file was deleted (because they don’t have a record of its deletion any more as as 60 days has past) and so will happily replicate this file to their folders – resulting in a deleted file now appearing and causing corruption.

It is to avoid such situations that Content Freshness Protection was invented and is recommended to be turned on.

Here’s a good blog post from the Directory Services team explaining Content Freshness Protection.

DNS Client

For Server 2008 and above registers the A, AAAA, and PTR records for the DC with DNS (notice that when you change the DC IP address you do not have to update DNS manually – it is updated automatically. This is because of the DNS Client service).

Note that only the A, AAAA, and PTR records are registered. Other records are by the Netlogon service.

DNS Server

The glue for Active Directory. DNS is what domain controllers use to locate each other. DNS is what client computers use to find domain controllers. If this service is down both these functions fail.

Kerberos Distribution Center (KDC)

Required for Kerberos 5.0 authentication. AD domains use Kerberos for authentication. If the KDC service is stopped Kerberos authentication fails.

NTLM is not affected by this service.

Netlogon

Maintains the secure channel between DCs and domain members (including other DCs). This secure channel is used for authentication (NTLS and Kerberos) and DC replication.

Writes the SRV and other records to DNS. These records are what domain members use to find DCs.

The records are also written to a file %systemroot%\system32\config\Netlogon.DNS. If the DNS server doesn’t support dynamic updates then the records in this text file must be manually created on the DNS server.

The Windows Time service on every domain member looks to the DC that authenticates them for time time updates.

DCs in the domain look to the domain PDC for time updates.

Domain PDCs look to the domain PDC of the domain above/ sibling to them. Except the forest root domain PDC who gets time from an external source (hardware source, Internet, etc).

From this link: there are two registry keys HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxPosPhaseCorrection and HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Config\MaxNegPhaseCorrection that restrict the time updates accepted by the Windows Time service to the number of seconds defined by these values (the maximum and minimum range). This can be set directly in the registry or via a GPO. The recommended value is 172800 (i.e. 48 hours).

w32tm

The w32tm command can be used to manage time. For instance:

To get an idea of the time situation in the domain (who is the master time keeper, what is the offset of each of the DCs from this time keeper):

1

w32tm/monitor

To ask the Windows Time service to resync as soon as possible (the command can target a remote computer too via the /computer: switch)

1

w32tm/resync

Same as above but before resyncing redetect any network configuration changes and rediscover the sources:

1

w32tm/resync/rediscover

To get the status of the local computer (use the /computer: switch to target a different computer)

1

w32tm/query/status

To show what time sources are being used:

1

w32tm/query/source

To show who the peers are:

1

w32tm/query/peers

To show the current time zone:

1

w32tm/tz

You can’t change the time zone using this command; you have to do:

1

tzutil/s"Time Zone Name"

On the PDC in the forest root domain you would typically run a command like this if you want it to get time from an NTP pool on the Internet:

specify a list of peers to sync time from (in this example the NTP Pool servers on the Internet);

the /update switch tells w32tm to update the Windows Time service with this configuration change;

the /syncfromflags:MANUAL tells the Windows Time service that it must only sync from these sources (other options such as “DOMHIER” tells it to sync from the domain peers only, “NO” tells it sync from none, “ALL” tells it to sync from both the domain peers and this manual list);

the /reliable:YES switch marksthis machine as special in that it is a reliable source of time for the domain (read this link on what gets set when you set a machine as RELIABLE).

Note: You must manually configure the time source on the PDC in the forest root domain and mark it as reliable. If that server were to fail and you transfer the role to another DC, be sure to repeat the step.

On other machines in the domain you would run a command like this:

1

w32tm/config/update/syncfromflags:DOMHIER/reliable:NO

This tells those DCs to follow the domain hierarchy (and only the domain hierarchy) and that they are not reliable time sources (this switch is not really needed if these other DCs are not PDCs).

Active Directory Domain Services (AD DS)

Provides the DC services. If this service is stopped the DC stops acting as a DC.

Pre-Server 2008 this service could not be stopped while the OS was online. But since Server 2008 it can be stopped and started.

The Active Directory Database Mounting Tool was new to me so here’s a link to what it does. It’s a pretty cool tool. Starting from Server 2008 you can take AD DS and AD LDS snapshots via the Volume Snapshots Service (VSS) (I am writing a post on VSS side by side so expect to see one soon). This makes use of the NTDS VSS writer which ensures that consistent snapshots of the AD databases can be taken. The AD snapshots can be taken manually via the ntdsutil snapshot command or via backup software or even via images of the whole system. Either ways, once you have such snapshots you can mount the one(s) you want via ntdsutil and point Active Directory Database Mounting Tool to it. As the tool name says it “mounts” the AD database in the snapshot and exposes it as an LDAP server. You can then use tools such as ldp.exe of the AD Users and Computers to go through this instance of the AD database. More info on this tool can be found at this and this link.

AD WS is what the PowerShell Active Directory module connects to.

It is also what the new Active Directory Administrative Center (which in turn uses PowerShell) too connects to.

AD WS is installed automatically when the AD DS or AD LDS roles are installed. It is only activated once the server is promoted to a DC or if and AD LDS instance is created on it.